Enable dynamic shared memory (!548) · Merge requests · LHCb / Allen

Daniel Hugo Campora Perez requested to merge dcampora_enable_dynamic_shared_memory into master Mar 24, 2021

Dynamic shared memory (https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/) is a feature that so far had no support in Allen. It requires quite a specific syntax and it was not trivial to generate code that would be compatible with the CPU backend.

This MR brings to Allen the ability to define dynamic shared memory buffers. It does so by introducing the following:

Allen::KernelInvocationConfiguration is intended to hold kernel invocation configuration. At present, it is only used to support dynamic shared memory. This kind of object has been useful for the SYCL demonstrator (!443 (closed)), so we might find another use for it in the future.
An attribute of type Allen::KernelInvocationConfiguration can be added to the Parameters of any algorithm. This kernel invocation config is populated automatically when invoking a global kernel function.
The macro DYNAMIC_SHARED_MEMORY_BUFFER(__type, __instance, __config) can be used to define a single shared memory buffer of type __type and instance __instance. The __config parameter must be of type Allen::KernelInvocationConfiguration.
global kernel invocation now accepts an optional input with the dynamic shared memory size. Invocation now follows the convention: grid dimension, block dimension, context, dynamic shared memory size (optional). It is reasonably close to CUDA's convention: grid dimension, block dimension, dynamic shared memory size (optional), context (optional), but since the context in Allen is not optional, the order of the last two argument changes.
Added a new property hit_window_size to algorithms LFSearchInitialWindows and LFTripletSeeding. LFTripletSeeding now uses dynamic shared memory and becomes the first such example in Allen.

An example:

So far, shared memory buffers (ie. types defined with the __shared__ identifier in a kernel) were restricted to be of static size. As a consequence configuration settings which normally would be properties were set to be constants.

This was the case for the hit window size in the Forward tracking of Allen. This MR turns that specific case into a property with the aid of this new feature.

In order to do that, first a configuration is required as part of the Parameters of the algorithm requiring dynamic shared memory, which is LFTripletSeeding:

  struct Parameters {
    Allen::KernelInvocationConfiguration config;
    HOST_INPUT(host_number_of_events_t, unsigned) host_number_of_events;
    HOST_INPUT(host_number_of_reconstructed_ut_tracks_t, unsigned) host_number_of_reconstructed_ut_tracks;
    ...
    PROPERTY(hit_window_size_t, "hit_window_size", "maximum hit window size", unsigned) hit_window_size;
  };

Now, config is accessible from the kernel. And a property hit_window_size_t has been added.

The shared memory must be configured at runtime, and for that we will need to pass an additional parameter to specify the dynamic shared memory size. The invocation was before:

  global_function(lf_triplet_seeding)(
    dim3(size<dev_event_list_t>(arguments)),
    dim3(LookingForward::triplet_seeding_block_dim_x, 2),
    context)(
    arguments, constants.dev_looking_forward_constants);

and now instead becomes:

  global_function(lf_triplet_seeding)(
    dim3(size<dev_event_list_t>(arguments)),
    dim3(LookingForward::triplet_seeding_block_dim_x, 2),
    context,
    3 * 2 * property<hit_window_size_t>() * sizeof(float))(
    arguments, constants.dev_looking_forward_constants);

The size of the dynamic shared memory buffer must be in bytes.

Finally, we can replace the previous static buffer in the global kernel which looked like:

__shared__ float shared_xs[3 * 2 * LookingForward::max_hit_window_size];

with:

DYNAMIC_SHARED_MEMORY_BUFFER(float, shared_xs, parameters.config)

The same restrictions as with CUDA dynamic shared memory buffers apply. That is, only at most one such buffer can be defined per kernel. However, this restriction can be overcome by allocating "one big" dynamic shared memory buffer, and creating pointers to it. For more information, please read https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/ .

Behind the scenes:

CUDA translates the DYNAMIC_SHARED_MEMORY_BUFFER(float, shared_xs, parameters.config) excerpt to:

#define DYNAMIC_SHARED_MEMORY_BUFFER(_type, _instance, _config) extern __shared__ _type _instance[];

HIP translates it to:

#define DYNAMIC_SHARED_MEMORY_BUFFER(_type, _instance, _config) HIP_DYNAMIC_SHARED(_type, _instance)

while CPU translates it into:

#define DYNAMIC_SHARED_MEMORY_BUFFER(_type, _instance, _config)                                                   \
  auto _dynamic_shared_memory_buffer = std::vector<_type>(_config.dynamic_shared_memory_size() / sizeof(_type)); \
  auto _instance = _dynamic_shared_memory_buffer.data();

This also means that the variable name _dynamic_shared_memory_buffer is implicitly used, but the chances of someone naming a variable like that are hopefully low...

Performance considerations:

Please note that there is a tradeoff between the expressability that dynamic shared memory allows, which can be nice in some cases, and the performance of the application. The compiler cannot know in advance the amount of resources that the kernel will be using and this may result in worse performance. That is on top of the more obvious moving from constexpr to a variable, which may also affect performance.

Edited Mar 26, 2021 by Daniel Hugo Campora Perez

Enable dynamic shared memory

An example:

Behind the scenes:

Performance considerations:

Merge request reports