Draft: Use cpu_cuda library as cpu backend
Use https://gitlab.cern.ch/ahennequ/cpu_cuda as cpu backend
Goals:
- Reimplement cuda behavior in c++, all cuda code should compile and behave exactly as on a gpu
- Reduce friction when implementing cuda kernels in Allen, especially when using warp intrinsics (remove the need for cpu specialization)
- Decouple the cpu backend from the Allen repository, increase modularity to allow reuse in other projects
TODO:
- Dispatch blocks on multiple threads to increase true parallelism
- Add ARM64 support
- Fix all corner cases