Enable dynamic shared memory
Dynamic shared memory (https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/) is a feature that so far had no support in Allen. It requires quite a specific syntax and it was not trivial to generate code that would be compatible with the CPU backend.
This MR brings to Allen the ability to define dynamic shared memory buffers. It does so by introducing the following:
-
Allen::KernelInvocationConfiguration
is intended to hold kernel invocation configuration. At present, it is only used to support dynamic shared memory. This kind of object has been useful for the SYCL demonstrator (!443 (closed)), so we might find another use for it in the future. - An attribute of type
Allen::KernelInvocationConfiguration
can be added to theParameters
of any algorithm. This kernel invocation config is populated automatically when invoking a global kernel function. - The macro
DYNAMIC_SHARED_MEMORY_BUFFER(__type, __instance, __config)
can be used to define a single shared memory buffer of type__type
and instance__instance
. The__config
parameter must be of typeAllen::KernelInvocationConfiguration
. - global kernel invocation now accepts an optional input with the dynamic shared memory size. Invocation now follows the convention:
grid dimension, block dimension, context, dynamic shared memory size (optional)
. It is reasonably close to CUDA's convention:grid dimension, block dimension, dynamic shared memory size (optional), context (optional)
, but since the context in Allen is not optional, the order of the last two argument changes. - Added a new property
hit_window_size
to algorithms LFSearchInitialWindows and LFTripletSeeding. LFTripletSeeding now uses dynamic shared memory and becomes the first such example in Allen.
An example:
So far, shared memory buffers (ie. types defined with the __shared__
identifier in a kernel) were restricted to be of static size. As a consequence configuration settings which normally would be properties were set to be constants.
This was the case for the hit window size
in the Forward tracking of Allen. This MR turns that specific case into a property with the aid of this new feature.
In order to do that, first a configuration is required as part of the Parameters of the algorithm requiring dynamic shared memory, which is LFTripletSeeding
:
struct Parameters {
Allen::KernelInvocationConfiguration config;
HOST_INPUT(host_number_of_events_t, unsigned) host_number_of_events;
HOST_INPUT(host_number_of_reconstructed_ut_tracks_t, unsigned) host_number_of_reconstructed_ut_tracks;
...
PROPERTY(hit_window_size_t, "hit_window_size", "maximum hit window size", unsigned) hit_window_size;
};
Now, config
is accessible from the kernel. And a property hit_window_size_t
has been added.
The shared memory must be configured at runtime, and for that we will need to pass an additional parameter to specify the dynamic shared memory size. The invocation was before:
global_function(lf_triplet_seeding)(
dim3(size<dev_event_list_t>(arguments)),
dim3(LookingForward::triplet_seeding_block_dim_x, 2),
context)(
arguments, constants.dev_looking_forward_constants);
and now instead becomes:
global_function(lf_triplet_seeding)(
dim3(size<dev_event_list_t>(arguments)),
dim3(LookingForward::triplet_seeding_block_dim_x, 2),
context,
3 * 2 * property<hit_window_size_t>() * sizeof(float))(
arguments, constants.dev_looking_forward_constants);
The size of the dynamic shared memory buffer must be in bytes.
Finally, we can replace the previous static buffer in the global kernel which looked like:
__shared__ float shared_xs[3 * 2 * LookingForward::max_hit_window_size];
with:
DYNAMIC_SHARED_MEMORY_BUFFER(float, shared_xs, parameters.config)
The same restrictions as with CUDA dynamic shared memory buffers apply. That is, only at most one such buffer can be defined per kernel. However, this restriction can be overcome by allocating "one big" dynamic shared memory buffer, and creating pointers to it. For more information, please read https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/ .
Behind the scenes:
CUDA translates the DYNAMIC_SHARED_MEMORY_BUFFER(float, shared_xs, parameters.config)
excerpt to:
#define DYNAMIC_SHARED_MEMORY_BUFFER(_type, _instance, _config) extern __shared__ _type _instance[];
HIP translates it to:
#define DYNAMIC_SHARED_MEMORY_BUFFER(_type, _instance, _config) HIP_DYNAMIC_SHARED(_type, _instance)
while CPU translates it into:
#define DYNAMIC_SHARED_MEMORY_BUFFER(_type, _instance, _config) \
auto _dynamic_shared_memory_buffer = std::vector<_type>(_config.dynamic_shared_memory_size() / sizeof(_type)); \
auto _instance = _dynamic_shared_memory_buffer.data();
This also means that the variable name _dynamic_shared_memory_buffer
is implicitly used, but the chances of someone naming a variable like that are hopefully low...
Performance considerations:
Please note that there is a tradeoff between the expressability that dynamic shared memory allows, which can be nice in some cases, and the performance of the application. The compiler cannot know in advance the amount of resources that the kernel will be using and this may result in worse performance. That is on top of the more obvious moving from constexpr
to a variable, which may also affect performance.