WIP: New algorithm definition (!270) · Merge requests · LHCb / Allen

Daniel Hugo Campora Perez requested to merge dcampora_scheduler into master Dec 17, 2019

This MR changes (almost completely) the way to define Allen algorithms. It introduces the following functionality:

Arguments:

Argument files are (will be) gone.
Algorithm arguments are specified alongside the algorithm definition.
Host and device arguments are supported and managed by separate memory manager instances.
An argument is defined as a struct (type), it has an inner type, and it can either be host or device, and either input or output.
The argument itself can be used in a function definition, thus removing the need to specify the argument's type repeatedly.
Arguments are safer: The type and whether it is input or output is carried over. An input argument is by definition const.

Ie.:

namespace velo_estimate_input_size {
  // Arguments
  HOST_INPUT(host_number_of_selected_events_t, uint)
  DEVICE_INPUT(dev_event_list_t, uint)
  DEVICE_OUTPUT(dev_velo_raw_input_t, char)
  DEVICE_OUTPUT(dev_velo_raw_input_offsets_t, uint)
  DEVICE_OUTPUT(dev_estimated_input_size_t, uint)
  DEVICE_OUTPUT(dev_module_candidate_num_t, uint)
  DEVICE_OUTPUT(dev_cluster_candidates_t, uint)

  // Global function
  __global__ void velo_estimate_input_size(
    dev_velo_raw_input_t dev_velo_raw_input,
    dev_velo_raw_input_offsets_t dev_velo_raw_input_offsets,
    dev_estimated_input_size_t dev_estimated_input_size,
    dev_module_candidate_num_t dev_module_candidate_num,
    dev_cluster_candidates_t dev_cluster_candidates,
    dev_event_list_t dev_event_list,
    uint8_t* candidate_ks);

...

Algorithms:

Handlers are gone. Visitors are gone.
Algorithms are now defined with a struct that inherits from either DeviceAlgorithm or HostAlgorithm. The only difference is that DeviceAlgorithm has a built-in configurable grid_dimension and block_dimension that can be used (similarly to Handlers versus CpuHandlers before).
Algorithms are expected to have a name (constexpr auto name {"some name"}).
Algorithms can store any number of function wrappers. Each function wrapper can either be a host_function or a global_function.
Algorithms are expected to have a set_arguments_size and an operator(). Similarly to visitors before them, set_arguments_size is expected to set the size of the output arguments. operator() is where the algorithm's body is expected (similar name to Gaudi). Function wrappers can be used to invoke the functions (more later).
Algorithms are templated now. They expect an std::tuple with argument types. These argument types are accessed by checking for the elements they inherit from (more later).

Ie.

  // Algorithm
  template<typename Arguments>
  struct velo_estimate_input_size_t : public DeviceAlgorithm
  {
    constexpr static auto name {"velo_estimate_input_size_t"};
    decltype(global_function(velo_estimate_input_size)) function {velo_estimate_input_size};

    void set_arguments_size(
      ArgumentRefManager<Arguments> arguments,
      const RuntimeOptions& runtime_options,
      const Constants& constants,
      const HostBuffers& host_buffers) const
    {
      ...
    }

    void operator()(
      const ArgumentRefManager<Arguments>& arguments,
      const RuntimeOptions& runtime_options,
      const Constants& constants,
      HostBuffers& host_buffers,
      cudaStream_t& cuda_stream,
      cudaEvent_t& cuda_generic_event) const
    {
      ...
    }
  };

Function wrappers:

Function wrappers have been simplified. There are no more set_arguments nor set_opts.
CpuFunction are (will be) now called HostFunction (which is more correct).
GpuFunction are (will be) now called GlobalFunction (DeviceFunction unfortunately has another meaning in CUDA terminology).
In order to make a function wrapper, the helpers host_function and global_function can be used.
Function wrappers only provide operator(). In the case of a HostFunction, this invokes the function.
In the case of a GlobalFunction, operator() accepts the grid dimension, block dimension and CUDA stream, and returns a closure to the actual function. This mimics the syntax of CUDA function calls.

Ie.

  // Algorithm
  template<typename Arguments>
  struct velo_estimate_input_size_t : public DeviceAlgorithm
  {
    constexpr static auto name {"velo_estimate_input_size_t"};
    decltype(global_function(velo_estimate_input_size)) function {velo_estimate_input_size};

// [...]

    void operator()(
      const ArgumentRefManager<Arguments>& arguments,
      const RuntimeOptions& runtime_options,
      const Constants& constants,
      HostBuffers& host_buffers,
      cudaStream_t& cuda_stream,
      cudaEvent_t& cuda_generic_event) const
    {
      // [...]
      // Invoke kernel
      function(dim3(offset<host_number_of_selected_events_t>(arguments)[0]), block_dimension(), cuda_stream)(
        offset<dev_velo_raw_input_t>(arguments),
        offset<dev_velo_raw_input_offsets_t>(arguments),
        offset<dev_estimated_input_size_t>(arguments),
        offset<dev_module_candidate_num_t>(arguments),
        offset<dev_cluster_candidates_t>(arguments),
        offset<dev_event_list_t>(arguments),
        constants.dev_velo_candidate_ks.data());
    }

Configurable parameters to algorithms:

Algorithms must now be configured with parameters.
Parameters can be organized in any manner desired.
Host and Device arguments can be used.
Parameters are picked up according to whether they inherited from the expected type.
For instance, as a consequence, now there it is possible to have only one prefix sum algorithm and reuse it with different parameters.

Ie.

ARG(host_event_list_t, host_global_event_cut::host_event_list_t)
ARG(host_number_of_selected_events_t,
  host_global_event_cut::host_number_of_selected_events_t,
  velo_estimate_input_size::host_number_of_selected_events_t,
  velo_consolidate_tracks::host_number_of_selected_events_t)

...

SEQUENCE_T(
  host_global_event_cut::host_global_event_cut_t<
    std::tuple<dev_event_list_t, host_event_list_t, host_number_of_selected_events_t>>,
  velo_estimate_input_size::velo_estimate_input_size_t<std::tuple<
    host_number_of_selected_events_t,
    dev_velo_raw_input_t,
    dev_velo_raw_input_offsets_t,
    dev_estimated_input_size_t,
    dev_module_candidate_num_t,
    dev_cluster_candidates_t,
    dev_event_list_t>>,

Host and device memory managers:

Two memory managers are now initialized. By default, the host memory manager has 10~MB.
So far, the host memory manager is only used for single uints which store the size of values (ie. the total size of each prefix sum in the sequence).
Only a single object of type ArgumentRefManager<Arguments> is passed to set sizes, read offsets or sizes.
The choice of the memory manager is performed through SFINAE.
The current design is extensible and allows for more memory managers to be added (ie. for types used only when validation is on).

Even though the MR is rather big, I don't see how these functionalities could be added in chunks without making it more confusing to developers.

Thanks a lot to @clemenci , @frankm and @nnolte for the discussions on varios design aspects.

TODO before merging:

Move all algorithms to templated syntax.
Generate a default sequence configuration.
Verify immutability condition and adapt necessary algorithms (ie. any parameter is an output parameter only once).
Check at compile time all outputs preceed any input use.

Separate items that can spark from this MR:

Expose configuration in an easy way to configure to the developer.

Edited Dec 17, 2019 by Daniel Hugo Campora Perez

WIP: New algorithm definition

Merge request reports