Added compatibility with CPU (!196) · Merge requests · LHCb / Allen

Daniel Campora Perez requested to merge dcampora_cpu_version into master Aug 22, 2019

This MR introduces the option -DCPU. Activating the option (cmake -DCPU=ON ..) produces the following:

All code is compiled with the host compiler.
No CUDA libraries or compiler are necessary.
All device code is compiled with a corresponding CPU version (ie. cudaMemcpy, cudaMemset, atomicAdd, are translated; __syncthreads(), __global__, __shared__ are removed; blockDim, gridDim, blockIdx, threadIdx are translated and one instance is created and kept through the application, etc.).
Effectively, this means the device is treated as the CPU in this configuration.

-DCPU defines the preprocessor variable CPU, which can be used with an ifdef in various sections of the code to optimize for CPU. In order to avoid code pollution, just the prefix sum additional copies have been removed.

In order to support the previous CUDA code, all for loops and if statements have been converted to block dimension-strided for loops. This has two benefits:

Code is now runnable by definition with any number of threads per block in the configuration (more robust / flexible).
Due to that, it is also runnable by a single thread, hard requirement for the CPU configuration.

As a consequence, this introduces in Allen the good practice that all for loops involving threadIdx conditional execution must be block dimension-strided loops. If this requirement is met, it is conceptually processable by the CPU. This MR ensures that condition is met throughout the Allen codebase.

The CPU version has been tested against a number of processors of different architectures: x86, Power8 and ARM. The results are similar on all architectures (including GPU), but there are differences attributable to differing processor optimizations. It should be noted the degree of difference between CPUs is similar to that between CPUs and the GPU. Four BsPhiPhi runs on x86, Power8, ARM and GPU are attached for comparison:

Some other byproducts of this cleanup:

Output of sequence is more homogeneized.
The code has become slightly faster on the Quadro RTX 6000.
By default, Tensor Cores are disabled. They can be enabled back with -DTENSOR=ON.
Added script that auto-detects CPU architecture in CMake, automatically changing the option -march into -mcpu depending on the underlying architecture.

Edited Aug 26, 2019 by Daniel Campora Perez

Added compatibility with CPU

Merge request reports