Backend improvements (!451) · Merge requests · LHCb / Allen

Daniel Hugo Campora Perez requested to merge dcampora_drop_boost_hana into master Nov 06, 2020

This MR heavily refactors the backend of Allen.

Algorithm changes

Use C++17's ability to convert structs to tuple instead of Boost::HANA. This means that Parameters should now be written as follows:

Old:

  DEFINE_PARAMETERS(
    Parameters,
    (HOST_INPUT(host_number_of_events_t, unsigned), host_number_of_events),
    (HOST_INPUT(host_number_of_cluster_candidates_t, unsigned), host_number_of_cluster_candidates),
    (DEVICE_INPUT(dev_event_list_t, unsigned), dev_event_list),
    (DEVICE_INPUT(dev_candidates_offsets_t, unsigned), dev_candidates_offsets),
    (DEVICE_INPUT(dev_velo_raw_input_t, char), dev_velo_raw_input),
    (DEVICE_INPUT(dev_velo_raw_input_offsets_t, unsigned), dev_velo_raw_input_offsets),
    (DEVICE_OUTPUT(dev_estimated_input_size_t, unsigned), dev_estimated_input_size),
    (DEVICE_OUTPUT(dev_module_candidate_num_t, unsigned), dev_module_candidate_num),
    (DEVICE_OUTPUT(dev_cluster_candidates_t, unsigned), dev_cluster_candidates),
    (PROPERTY(block_dim_t, "block_dim", "block dimensions", DeviceDimensions), block_dim))

New:

  struct Parameters {
    HOST_INPUT(host_number_of_events_t, unsigned) host_number_of_events;
    HOST_INPUT(host_number_of_cluster_candidates_t, unsigned) host_number_of_cluster_candidates;
    DEVICE_INPUT(dev_event_list_t, unsigned) dev_event_list;
    DEVICE_INPUT(dev_candidates_offsets_t, unsigned) dev_candidates_offsets;
    DEVICE_INPUT(dev_velo_raw_input_t, char) dev_velo_raw_input;
    DEVICE_INPUT(dev_velo_raw_input_offsets_t, unsigned) dev_velo_raw_input_offsets;
    DEVICE_OUTPUT(dev_estimated_input_size_t, unsigned) dev_estimated_input_size;
    DEVICE_OUTPUT(dev_module_candidate_num_t, unsigned) dev_module_candidate_num;
    DEVICE_OUTPUT(dev_cluster_candidates_t, unsigned) dev_cluster_candidates;
    PROPERTY(block_dim_t, "block_dim", "block dimensions", DeviceDimensions) block_dim;
  };

Apart from the syntax change to a more C++-like one, PROPERTYs do not have to have instances anymore. That means that a PROPERTY may now be defined as below (note the absence of block_dim, an instance of that property before the semicolon). This means that it is now possible for PROPERTYs that are not going to be used in a kernel to not be passed by value to the kernel.

  struct Parameters {
    ...
    PROPERTY(block_dim_t, "block_dim", "block dimensions", DeviceDimensions);
  };

The above has allowed all std::string properties to become instance-less, which was not desirable for GPU kernels.

Framework changes

The above transformation also required some framework changes. The recent developments done for the SYCL branch have been refactored, and the following good practices that are independent of the target code have been developed. These developments, in turn, make the framework better designed and more easily translatable to SYCL in the future, if we decide to do so.

BackendCommonInterface defines a common interface instead of using CUDA raw calls in the framework. Note that this does not affect the kernels, which are still written in CUDA. Here is the new interface:

namespace Allen {
  // Holds an execution context. An execution
  // context allows to execute kernels in parallel,
  // and provides a manner for execution to be stopped.
  struct Context;

  // Memcpy kind used in memory transfers, analogous to cudaMemcpyKind
  enum memcpy_kind {
    memcpyHostToHost,
    memcpyHostToDevice,
    memcpyDeviceToHost,
    memcpyDeviceToDevice,
    memcpyDefault
  };

  enum host_register_kind {
    hostRegisterDefault,
    hostRegisterPortable,
    hostRegisterMapped
  };

  enum class error {
    success,
    errorMemoryAllocation
  };

  void malloc(void** devPtr, size_t size);
  void malloc_host(void** ptr, size_t size);
  void memcpy(void* dst, const void* src, size_t count, enum memcpy_kind kind);
  void memcpy_async(void* dst, const void* src, size_t count, enum memcpy_kind kind, const Context& context);
  void memset(void* devPtr, int value, size_t count);
  void memset_async(void* ptr, int value, size_t count, const Context& context);
  void free_host(void* ptr);
  void free(void* ptr);
  void synchronize(const Context& context);
  void device_reset();
  void peek_at_last_error();
  void host_unregister(void* ptr);
  void host_register(void* ptr, size_t size, enum host_register_kind flags);
} // namespace Allen

An Allen::Context is used in all algorithms instead of a stream and event. The Context has a stream and an event behind the scenes when compiling for CUDA/HIP.
The store of Allen has been moved from the parameters to an std::array<ArgumentData, N>. The indexing of parameters into this store is done based on the parameter order, which is guaranteed and kept in the parsing / code generation steps. This has the benefit that Parameters objects are now trivially copiable, that the default implementation of the store makes sense (as opposed to host_datatype and device_datatype having virtual placeholders like before), and that the generated sequences are much shorter, since parameters don't have to override the behaviour of their stores. ArgumentData is a type-erased store for any datatype, and it looks as follows:

struct ArgumentData {
private:
  char* m_base_pointer = nullptr;
  size_t m_size = 0;
  std::string m_name = "";

public:
  virtual char* pointer() const { return m_base_pointer; }
  virtual size_t size() const { return m_size; }
  virtual std::string name() const { return m_name; }
  virtual void set_pointer(char* pointer) { m_base_pointer = pointer; }
  virtual void set_size(size_t size) { m_size = size; }
  virtual void set_name(const std::string& name) { m_name = name; }
  virtual ~ArgumentData() {}
};

ArgumentData is virtual in preparation for the Allen-Gaudi automatic algorithm conversion.
Names of parameters are set at runtime, similarly to how algorithm names are set.
The store functions and function names have been refactored: struct types have been given visibility (private, public), methods have more meaningful names (eg. pointer when it refers to pointer).
HostFunction and GlobalFunction have been refactored into TargetFunction and TransformParameters, which deal with the target functions and implement the functionality to transform arguments into parameters respectively.
General improvements to the logic of SchedulerMachinery, cleanup of TupleTools repeated functionality (index_of implemented the same as TupleContains).
Lines now don't pass this to the global function call anymore. Instead, all calls to functions have been converted to static calls, using the Derived datatype.
Global and host function calls don't pass this anymore. Instead, they pass their m_properties object, since the need of the object instance was only to be able to access its properties.

Rework of memory managers

Memory managers now have proper visibility in their members.
A m_name better identifies the memory managers and is used when printing its state (eg. when invoking ./Allen -p 1).
Memory managers now dispatch the pointers when reserving (as opposed to an offset like before). This better decouples the functionality from Argument Managers, which used to store the base pointers before. Now, memory managers store (if relevant) base pointers.
Memory managers have become templated: One can chose between a Host or Device memory manager, and between a SingleAlloc or MultiAlloc memory manager.
- SingleAlloc memory managers act like before.
- MultiAlloc memory managers dispatch malloc or free to the underlying memory manager implementation. This is rather slow, but it is much better for finding out out-of-bound writes. The MultiAlloc memory managers can be compiled with the added cmake option MALLOC_ENGINE. Choosing -DMALLOC_ENGINE=MULTI_ALLOC will trigger using the new memory manager.

Edited Nov 23, 2020 by Daniel Hugo Campora Perez

Backend improvements

Algorithm changes

Framework changes

Rework of memory managers

Merge request reports