WIP: SYCL support (!443) · Merge requests · LHCb / Allen

This MR attempts to add SYCL 2020 support for Allen.

Current status: Allen VELO subsequence compiles, runs and gets the expected efficiency.

Core changes

A common interface has been defined for memory-related operations. This common interface is used instead of raw cuda calls when reserving or freeing memory. No calls with "cuda" in its name are used across the framework. The methods provided by the common interface are:

  // Memcpy kind used in memory transfers, analogous to cudaMemcpyKind
  enum memcpy_kind {
    memcpyHostToHost,
    memcpyHostToDevice,
    memcpyDeviceToHost,
    memcpyDeviceToDevice,
    memcpyDefault
  };

  enum host_register_kind {
    hostRegisterDefault,
    hostRegisterPortable,
    hostRegisterMapped
  };

  enum class error {
    success,
    errorMemoryAllocation
  };

  void malloc(void** devPtr, size_t size);
  void malloc_host(void** ptr, size_t size);
  void memcpy(void* dst, const void* src, size_t count, enum memcpy_kind kind);
  void memcpy_async(void* dst, const void* src, size_t count, enum memcpy_kind kind, const Context& context);
  void memset(void* devPtr, int value, size_t count);
  void memset_async(void* ptr, int value, size_t count, const Context& context);
  void free_host(void* ptr);
  void free(void* ptr);
  void synchronize(const Context& context);
  void device_reset();
  void peek_at_last_error();
  void host_unregister(void* ptr);
  void host_register(void* ptr, size_t size, enum host_register_kind flags);
}

An Allen::Context acts as the replacement for cudaStream_t and cudaEvent_t, and holds whatever is necessary for the backend to execute. In the case of SYCL, it uses a global sycl::queue object.
A database of arguments has been created to store the values of arguments throughout the codebase. This was needed to remove the use of virtual in arguments, which made the Parameters objects non trivially-copiable, a need for kernel invocation in SYCL. The database of arguments acts like an map of type -> {pointer, size}. However, it is implemented as an std::array since the parameter order is guaranteed, which is more efficient than a map.
Kernel calls to threadIdx, blockIdx, blockDim and gridDim objects have been replaced with calls to a singleton object that is passed optionally with the parameters of type KernelInvocationConfiguration. KernelInvocationConfiguration must provide the following functions:

  struct KernelInvocationConfiguration {
    template<unsigned>
    unsigned threadIdx() const;

    template<unsigned>
    unsigned blockIdx() const;

    template<unsigned>
    unsigned blockDim() const;

    template<unsigned>
    unsigned gridDim() const;

    void syncthreads() const;

    void syncwarp() const;
  };

The implementation of those functions is backend-dependent, and they may or may not rely on a private member in the KernelInvocationConfiguration class to provide implementations to the above functions. The concrete implementations are:

Algorithm changes

Boost::HANA is not required anymore. Instead, structured bindings are used to obtain a tuple from the Parameters struct. This increases readability in the Parameters definition. Eg.

  struct Parameters {
    Allen::KernelInvocationConfiguration config;
    HOST_INPUT(host_number_of_selected_events_t, unsigned) host_number_of_selected_events;
    HOST_INPUT(host_number_of_cluster_candidates_t, unsigned) host_number_of_cluster_candidates;
    DEVICE_INPUT(dev_event_list_t, unsigned) dev_event_list;
    DEVICE_INPUT(dev_candidates_offsets_t, unsigned) dev_candidates_offsets;
    DEVICE_INPUT(dev_velo_raw_input_t, char) dev_velo_raw_input;
    DEVICE_INPUT(dev_velo_raw_input_offsets_t, unsigned) dev_velo_raw_input_offsets;
    DEVICE_OUTPUT(dev_estimated_input_size_t, unsigned) dev_estimated_input_size;
    DEVICE_OUTPUT(dev_module_candidate_num_t, unsigned) dev_module_candidate_num;
    DEVICE_OUTPUT(dev_cluster_candidates_t, unsigned) dev_cluster_candidates;
    PROPERTY(block_dim_t, "block_dim", "block dimensions", DeviceDimensions) block_dim;
  };

The Allen::KernelInvocationConfiguration config parameter is optional and it will be automatically populated only if it exists.
SYCL prevents invoking any device functionality from a function pointer (it issues a compilation error). This limitation has several consequences in our code:
- It is necessary to invoke dispatch with a lambda_wrapper macro around each function. The lambda wrapper is defined simply as:
```
#define lambda_wrap(fn) [](auto&&... args) { fn(args...); }
```
  Conversely, nvcc complains about this definition. Therefore, the above macro is SYCL-specific, and in other backends the definition is:
```
#define lambda_wrap(fn) fn
```
- The invoke function requires a lambda instead of a function pointer. Passing a lambda in the global_function call would make the call unnecessarily verbose, so instead global_function is now a macro that does that behind the scenes:
```
#define global_function(function_name) global_function_impl(function_name, lambda_wrap(function_name))
```
The invoke function has been extended to support SYCL. A nice effect from having the KernelInvocationConfiguration is that the CPU backend will also benefit from not using thread_local variables anymore.
SYCL doesn't support __shared__ memory with a syntax that is akin to CUDA's. Using local buffers would mean changing completely how algorithms are defined. Therefore, __shared__ buffers are "not supported" and workarounds to use global memory are used.
The invoke function requires an unused argument to work. This is rather puzzling and took very long to find out... Below the used argument and the relevant section of the code:

  context.queue().submit([&](sycl::handler& h) {
    sycl::stream sycl_stream(1024, 256, h); // Commenting this line messes up the stack and produces a different result.
    h.parallel_for(
    ...

printf is not supported in SYCL device code. The (very few) instances with device functions with printf statements have been surrounded by ifdef statements such that the SYCL compiler does not see that code.
There is one instance of syncthreads (translated to item->barrier(sycl::access::fence_space::local_space);) that fails with the error below. It has been ifdef'ed out for the SYCL target.

terminate called after throwing an instance of 'cl::sycl::runtime_error'
 what():  OpenCL API failed. OpenCL API returns: -5 (CL_OUT_OF_RESOURCES) -5 (CL_OUT_OF_RESOURCES)

Known issues

Only the velo subsequence has been "ported" to the use of KernelInvocationConfiguration.
The design of setting the store of each argument in a separate container (needed here to make Parameters trivially copiable) clashes with the design of the Allen-Gaudi conversion, where the store of each element is its own std::vector and functions that interact are overriden. Ie. https://gitlab.cern.ch/lhcb/Allen/-/blob/dcampora_nnolte_gaudi_conversion/Rec/Allen/Allen/AlgorithmConversionTools.h#L35 vs. https://gitlab.cern.ch/lhcb/Allen/-/blob/dcampora_sycl_2020_dev/stream/gear/include/ArgumentManager.cuh#L19
The gitlab CI job for the SYCL backend unfortunately results in a "job failed" (https://gitlab.cern.ch/lhcb/Allen/-/jobs/10482551). This is funny, since clicking on "Debug" while the CI job is running lets the user successfully source the setvars.sh file.

$ if [ "${TARGET}" = "SYCL" ]; then source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/x86_64/2021/beta9/setvars.sh; fi
ERROR: Job failed: exit code 1

The idiom found online for struct-to-tuple (\https://godbolt.org/z/8xWh9E) does not work for the latest nvcc version (nvcc 11). This has been reported to NVIDIA. @nnolte wrote a nvcc-compatible version of that code which is included in this branch.

Edited Oct 27, 2020 by Daniel Hugo Campora Perez

WIP: SYCL support

Core changes

Algorithm changes

Known issues

Merge request reports