No separable compilation and AMD compatibility
This MR makes the code use "no separable compilation". This has profound consequences:
- All classes, structs, functions, etc. must be defined within each compilation unit.
- Any functionality provided by a separate package must be provided within header files.
- Non-templated functions defined in header files to be used elsewhere must be inlined.
On the other hand, current heterogeneous architecture compilers prefer such mode. nvcc performs better (eg. !369 (merged)). In the case of HIP, separable compilation is needed for the application to run on AMD hardware at present.
The changes in this branch enable the following:
- HIP compiles, the Allen framework and all its algorithms run on tested AMD hardware.
- hipcc works from version 3.3.
- hip-clang works since dev version 3.2.2.
The physics efficiency in AMD hardware has been found to be very similar to NVIDIA hardware. The throughput should be studied and improved. Concretely, there seems to be a problem during muon decoding that makes the sequence much slower on AMD hardware.
Performance of the forward
sequence is about 3.3x slower than on comparable NVIDIA hardware.
This MR comes on top of !369 (merged), !365 (merged) and !361 (merged).