Draft: Adding possibility to use multible GPUs on a single machine.
Distributing the beam dt, dE and id arrays on several GPUs allows allocating memory beyond one GPU. This should allow simulations with many bunches and many macro particles in a reasonable time.
With 4x15GB VRAM, this should allow simulations using 10^9 macro particles.
(Some edge cases might allocate clone arrays of dt etc. leading to errors, but none are known so far)