WIP: SYCL support
This MR attempts to add SYCL 2020 support for Allen.
- Current status: Allen VELO subsequence compiles, runs and gets the expected efficiency.
Core changes
- A common interface has been defined for memory-related operations. This common interface is used instead of raw cuda calls when reserving or freeing memory. No calls with "cuda" in its name are used across the framework. The methods provided by the common interface are:
// Memcpy kind used in memory transfers, analogous to cudaMemcpyKind
enum memcpy_kind {
memcpyHostToHost,
memcpyHostToDevice,
memcpyDeviceToHost,
memcpyDeviceToDevice,
memcpyDefault
};
enum host_register_kind {
hostRegisterDefault,
hostRegisterPortable,
hostRegisterMapped
};
enum class error {
success,
errorMemoryAllocation
};
void malloc(void** devPtr, size_t size);
void malloc_host(void** ptr, size_t size);
void memcpy(void* dst, const void* src, size_t count, enum memcpy_kind kind);
void memcpy_async(void* dst, const void* src, size_t count, enum memcpy_kind kind, const Context& context);
void memset(void* devPtr, int value, size_t count);
void memset_async(void* ptr, int value, size_t count, const Context& context);
void free_host(void* ptr);
void free(void* ptr);
void synchronize(const Context& context);
void device_reset();
void peek_at_last_error();
void host_unregister(void* ptr);
void host_register(void* ptr, size_t size, enum host_register_kind flags);
}
- An
Allen::Context
acts as the replacement forcudaStream_t
andcudaEvent_t
, and holds whatever is necessary for the backend to execute. In the case of SYCL, it uses a globalsycl::queue
object. - A database of arguments has been created to store the values of arguments throughout the codebase. This was needed to remove the use of
virtual
in arguments, which made theParameters
objects non trivially-copiable, a need for kernel invocation in SYCL. The database of arguments acts like an map oftype -> {pointer, size}
. However, it is implemented as anstd::array
since the parameter order is guaranteed, which is more efficient than a map. - Kernel calls to
threadIdx
,blockIdx
,blockDim
andgridDim
objects have been replaced with calls to a singleton object that is passed optionally with the parameters of typeKernelInvocationConfiguration
.KernelInvocationConfiguration
must provide the following functions:
struct KernelInvocationConfiguration {
template<unsigned>
unsigned threadIdx() const;
template<unsigned>
unsigned blockIdx() const;
template<unsigned>
unsigned blockDim() const;
template<unsigned>
unsigned gridDim() const;
void syncthreads() const;
void syncwarp() const;
};
The implementation of those functions is backend-dependent, and they may or may not rely on a private member in the KernelInvocationConfiguration
class to provide implementations to the above functions. The concrete implementations are:
- CPU: https://gitlab.cern.ch/lhcb/Allen/-/blob/dcampora_sycl_2020_dev/backend/include/CPUBackend.h#L259
- CUDA, HIP: https://gitlab.cern.ch/lhcb/Allen/-/blob/dcampora_sycl_2020_dev/backend/include/CUDAHIPBackend.h#L52
- SYCL: https://gitlab.cern.ch/lhcb/Allen/-/blob/dcampora_sycl_2020_dev/backend/include/SYCLBackend.h#L79
Algorithm changes
- Boost::HANA is not required anymore. Instead, structured bindings are used to obtain a tuple from the
Parameters
struct. This increases readability in the Parameters definition. Eg.
struct Parameters {
Allen::KernelInvocationConfiguration config;
HOST_INPUT(host_number_of_selected_events_t, unsigned) host_number_of_selected_events;
HOST_INPUT(host_number_of_cluster_candidates_t, unsigned) host_number_of_cluster_candidates;
DEVICE_INPUT(dev_event_list_t, unsigned) dev_event_list;
DEVICE_INPUT(dev_candidates_offsets_t, unsigned) dev_candidates_offsets;
DEVICE_INPUT(dev_velo_raw_input_t, char) dev_velo_raw_input;
DEVICE_INPUT(dev_velo_raw_input_offsets_t, unsigned) dev_velo_raw_input_offsets;
DEVICE_OUTPUT(dev_estimated_input_size_t, unsigned) dev_estimated_input_size;
DEVICE_OUTPUT(dev_module_candidate_num_t, unsigned) dev_module_candidate_num;
DEVICE_OUTPUT(dev_cluster_candidates_t, unsigned) dev_cluster_candidates;
PROPERTY(block_dim_t, "block_dim", "block dimensions", DeviceDimensions) block_dim;
};
-
The
Allen::KernelInvocationConfiguration config
parameter is optional and it will be automatically populated only if it exists. -
SYCL prevents invoking any device functionality from a function pointer (it issues a compilation error). This limitation has several consequences in our code:
-
It is necessary to invoke
dispatch
with alambda_wrapper
macro around each function. The lambda wrapper is defined simply as:#define lambda_wrap(fn) [](auto&&... args) { fn(args...); }
Conversely,
nvcc
complains about this definition. Therefore, the above macro is SYCL-specific, and in other backends the definition is:#define lambda_wrap(fn) fn
-
The
invoke
function requires a lambda instead of a function pointer. Passing a lambda in theglobal_function
call would make the call unnecessarily verbose, so insteadglobal_function
is now a macro that does that behind the scenes:#define global_function(function_name) global_function_impl(function_name, lambda_wrap(function_name))
-
-
The invoke function has been extended to support SYCL. A nice effect from having the
KernelInvocationConfiguration
is that the CPU backend will also benefit from not usingthread_local
variables anymore. -
SYCL doesn't support
__shared__
memory with a syntax that is akin to CUDA's. Usinglocal
buffers would mean changing completely how algorithms are defined. Therefore,__shared__
buffers are "not supported" and workarounds to use global memory are used. -
The invoke function requires an unused argument to work. This is rather puzzling and took very long to find out... Below the used argument and the relevant section of the code:
context.queue().submit([&](sycl::handler& h) {
sycl::stream sycl_stream(1024, 256, h); // Commenting this line messes up the stack and produces a different result.
h.parallel_for(
...
-
printf
is not supported in SYCL device code. The (very few) instances with device functions withprintf
statements have been surrounded by ifdef statements such that the SYCL compiler does not see that code. -
There is one instance of
syncthreads
(translated toitem->barrier(sycl::access::fence_space::local_space);
) that fails with the error below. It has been ifdef'ed out for the SYCL target.
terminate called after throwing an instance of 'cl::sycl::runtime_error'
what(): OpenCL API failed. OpenCL API returns: -5 (CL_OUT_OF_RESOURCES) -5 (CL_OUT_OF_RESOURCES)
Known issues
-
Only the
velo
subsequence has been "ported" to the use ofKernelInvocationConfiguration
. -
The design of setting the store of each argument in a separate container (needed here to make Parameters trivially copiable) clashes with the design of the Allen-Gaudi conversion, where the store of each element is its own
std::vector
and functions that interact are overriden. Ie. https://gitlab.cern.ch/lhcb/Allen/-/blob/dcampora_nnolte_gaudi_conversion/Rec/Allen/Allen/AlgorithmConversionTools.h#L35 vs. https://gitlab.cern.ch/lhcb/Allen/-/blob/dcampora_sycl_2020_dev/stream/gear/include/ArgumentManager.cuh#L19 -
The gitlab CI job for the SYCL backend unfortunately results in a "job failed" (https://gitlab.cern.ch/lhcb/Allen/-/jobs/10482551). This is funny, since clicking on "Debug" while the CI job is running lets the user successfully source the
setvars.sh
file.
$ if [ "${TARGET}" = "SYCL" ]; then source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/x86_64/2021/beta9/setvars.sh; fi
ERROR: Job failed: exit code 1
- The idiom found online for struct-to-tuple (\https://godbolt.org/z/8xWh9E) does not work for the latest
nvcc
version (nvcc 11
). This has been reported to NVIDIA. @nnolte wrote anvcc
-compatible version of that code which is included in this branch.