Split device memory manager into a static and dynamic part
Once #500 is done, the only remaining communications between the host and device within the sequence (excluding the banks transfer at the start, and output transfer at the end) would be the sizes of the buffers (device->host) used for dynamic allocations.
These are very tiny transfers (usually 4 bytes), that cannot be batched and that are causing unnecessary synchronisations.
Since we are already using a pre-allocated memory pool for all device allocations, it doesn't really matter if the allocation within the pool is done on the host or the device. A very simple device side allocator can be implemented, and used as follow:
__device__ struct DumbDeviceAllocator {
char* m_base_ptr{};
unsigned m_next_available{0};
template<typename T>
__device__ T* alloc(unsigned size) {
return reinterpret_cast<T*>(m_base_ptr + atomicAdd(&m_next_available, sizeof(T) * size)));
}
} allocator;
__global__ void alloc_kernel(unsigned** myData, unsigned* offsets, unsigned n_offsets) {
if (threadIdx.x == 0 && blockIdx.x == 0) *myData = allocator.alloc(offsets[n_offsets]);
}
A more complex design could also handle freeing segments for reuse.
Not all data has to be allocated from the device, and at the very least, the structures receiving the result of device-side allocations need to be allocated from the host. But the sizes of theses allocations are static (known at configuration time). It is therefore possible to use a static allocator. Effectively, this architecture would split the device memory, into 2 pools: static and dynamic. The maximum size needed for the static pool can be found at compilation time and guaranteed to never be exceeded.
By defining a static allocation as an allocation that have a constant size (including sizes that depends on the batch size (number of events) and algorithm properties), we get the following proportions for a typical sequence (n_events = 1000):
# buffers size (MB)
ALL: 1391 892.0383977890015
static host: 314 101.0207166671753
static device: 284 333.9323921203613
dynamic host: 57 9.637184143066406
dynamic device: 402 423.1993103027344
TODO: continue writting the issue
WIP