Added compatibility with CPU
This MR introduces the option -DCPU
. Activating the option (cmake -DCPU=ON ..
) produces the following:
- All code is compiled with the host compiler.
- No CUDA libraries or compiler are necessary.
- All device code is compiled with a corresponding CPU version (ie.
cudaMemcpy
,cudaMemset
,atomicAdd
, are translated;__syncthreads()
,__global__
,__shared__
are removed;blockDim
,gridDim
,blockIdx
,threadIdx
are translated and one instance is created and kept through the application, etc.). - Effectively, this means the
device
is treated as the CPU in this configuration.
-DCPU
defines the preprocessor variable CPU
, which can be used with an ifdef
in various sections of the code to optimize for CPU. In order to avoid code pollution, just the prefix sum additional copies have been removed.
In order to support the previous CUDA code, all for
loops and if
statements have been converted to block dimension-strided for loops. This has two benefits:
- Code is now runnable by definition with any number of threads per block in the configuration (more robust / flexible).
- Due to that, it is also runnable by a single thread, hard requirement for the CPU configuration.
As a consequence, this introduces in Allen the good practice that all for loops involving threadIdx conditional execution must be block dimension-strided loops. If this requirement is met, it is conceptually processable by the CPU. This MR ensures that condition is met throughout the Allen codebase.
The CPU version has been tested against a number of processors of different architectures: x86, Power8 and ARM. The results are similar on all architectures (including GPU), but there are differences attributable to differing processor optimizations. It should be noted the degree of difference between CPUs is similar to that between CPUs and the GPU. Four BsPhiPhi runs on x86, Power8, ARM and GPU are attached for comparison:
- allen_result_bsphiphi_ibmpower8.txt
- allen_result_bsphiphi_thunderx2.txt
- allen_result_bsphiphi_x86_64.txt
- bsphiphi_result_gpu.txt
Some other byproducts of this cleanup:
- Output of sequence is more homogeneized.
- The code has become slightly faster on the Quadro RTX 6000.
- By default, Tensor Cores are disabled. They can be enabled back with
-DTENSOR=ON
. - Added script that auto-detects CPU architecture in CMake, automatically changing the option
-march
into-mcpu
depending on the underlying architecture.