Several fixes for single-precision+vector(+CUDA) mode
This fixes the compilation errors in single precision+vector mode, but also fixes all failing ctest tests in Debug mode. Some tests are still failing in Release mode. It turns out the compilation options we use for nvcc do not produce numerically equivalent results, and this is not (only) due to usage of fast math.