Backend improvements
This MR heavily refactors the backend of Allen.
Algorithm changes
- Use C++17's ability to convert structs to tuple instead of Boost::HANA. This means that Parameters should now be written as follows:
Old:
DEFINE_PARAMETERS(
Parameters,
(HOST_INPUT(host_number_of_events_t, unsigned), host_number_of_events),
(HOST_INPUT(host_number_of_cluster_candidates_t, unsigned), host_number_of_cluster_candidates),
(DEVICE_INPUT(dev_event_list_t, unsigned), dev_event_list),
(DEVICE_INPUT(dev_candidates_offsets_t, unsigned), dev_candidates_offsets),
(DEVICE_INPUT(dev_velo_raw_input_t, char), dev_velo_raw_input),
(DEVICE_INPUT(dev_velo_raw_input_offsets_t, unsigned), dev_velo_raw_input_offsets),
(DEVICE_OUTPUT(dev_estimated_input_size_t, unsigned), dev_estimated_input_size),
(DEVICE_OUTPUT(dev_module_candidate_num_t, unsigned), dev_module_candidate_num),
(DEVICE_OUTPUT(dev_cluster_candidates_t, unsigned), dev_cluster_candidates),
(PROPERTY(block_dim_t, "block_dim", "block dimensions", DeviceDimensions), block_dim))
New:
struct Parameters {
HOST_INPUT(host_number_of_events_t, unsigned) host_number_of_events;
HOST_INPUT(host_number_of_cluster_candidates_t, unsigned) host_number_of_cluster_candidates;
DEVICE_INPUT(dev_event_list_t, unsigned) dev_event_list;
DEVICE_INPUT(dev_candidates_offsets_t, unsigned) dev_candidates_offsets;
DEVICE_INPUT(dev_velo_raw_input_t, char) dev_velo_raw_input;
DEVICE_INPUT(dev_velo_raw_input_offsets_t, unsigned) dev_velo_raw_input_offsets;
DEVICE_OUTPUT(dev_estimated_input_size_t, unsigned) dev_estimated_input_size;
DEVICE_OUTPUT(dev_module_candidate_num_t, unsigned) dev_module_candidate_num;
DEVICE_OUTPUT(dev_cluster_candidates_t, unsigned) dev_cluster_candidates;
PROPERTY(block_dim_t, "block_dim", "block dimensions", DeviceDimensions) block_dim;
};
- Apart from the syntax change to a more C++-like one, PROPERTYs do not have to have instances anymore. That means that a PROPERTY may now be defined as below (note the absence of
block_dim
, an instance of that property before the semicolon). This means that it is now possible for PROPERTYs that are not going to be used in a kernel to not be passed by value to the kernel.
struct Parameters {
...
PROPERTY(block_dim_t, "block_dim", "block dimensions", DeviceDimensions);
};
- The above has allowed all
std::string
properties to become instance-less, which was not desirable for GPU kernels.
Framework changes
The above transformation also required some framework changes. The recent developments done for the SYCL branch have been refactored, and the following good practices that are independent of the target code have been developed. These developments, in turn, make the framework better designed and more easily translatable to SYCL in the future, if we decide to do so.
-
BackendCommonInterface
defines a common interface instead of using CUDA raw calls in the framework. Note that this does not affect the kernels, which are still written in CUDA. Here is the new interface:
namespace Allen {
// Holds an execution context. An execution
// context allows to execute kernels in parallel,
// and provides a manner for execution to be stopped.
struct Context;
// Memcpy kind used in memory transfers, analogous to cudaMemcpyKind
enum memcpy_kind {
memcpyHostToHost,
memcpyHostToDevice,
memcpyDeviceToHost,
memcpyDeviceToDevice,
memcpyDefault
};
enum host_register_kind {
hostRegisterDefault,
hostRegisterPortable,
hostRegisterMapped
};
enum class error {
success,
errorMemoryAllocation
};
void malloc(void** devPtr, size_t size);
void malloc_host(void** ptr, size_t size);
void memcpy(void* dst, const void* src, size_t count, enum memcpy_kind kind);
void memcpy_async(void* dst, const void* src, size_t count, enum memcpy_kind kind, const Context& context);
void memset(void* devPtr, int value, size_t count);
void memset_async(void* ptr, int value, size_t count, const Context& context);
void free_host(void* ptr);
void free(void* ptr);
void synchronize(const Context& context);
void device_reset();
void peek_at_last_error();
void host_unregister(void* ptr);
void host_register(void* ptr, size_t size, enum host_register_kind flags);
} // namespace Allen
-
An
Allen::Context
is used in all algorithms instead of a stream and event. The Context has a stream and an event behind the scenes when compiling for CUDA/HIP. -
The store of Allen has been moved from the parameters to an
std::array<ArgumentData, N>
. The indexing of parameters into this store is done based on the parameter order, which is guaranteed and kept in the parsing / code generation steps. This has the benefit thatParameters
objects are now trivially copiable, that the default implementation of the store makes sense (as opposed tohost_datatype
anddevice_datatype
having virtual placeholders like before), and that the generated sequences are much shorter, since parameters don't have to override the behaviour of their stores.ArgumentData
is a type-erased store for any datatype, and it looks as follows:
struct ArgumentData {
private:
char* m_base_pointer = nullptr;
size_t m_size = 0;
std::string m_name = "";
public:
virtual char* pointer() const { return m_base_pointer; }
virtual size_t size() const { return m_size; }
virtual std::string name() const { return m_name; }
virtual void set_pointer(char* pointer) { m_base_pointer = pointer; }
virtual void set_size(size_t size) { m_size = size; }
virtual void set_name(const std::string& name) { m_name = name; }
virtual ~ArgumentData() {}
};
-
ArgumentData is
virtual
in preparation for the Allen-Gaudi automatic algorithm conversion. -
Names of parameters are set at runtime, similarly to how algorithm names are set.
-
The store functions and function names have been refactored: struct types have been given visibility (private, public), methods have more meaningful names (eg. pointer when it refers to pointer).
-
HostFunction
andGlobalFunction
have been refactored intoTargetFunction
andTransformParameters
, which deal with the target functions and implement the functionality to transform arguments into parameters respectively. -
General improvements to the logic of SchedulerMachinery, cleanup of TupleTools repeated functionality (
index_of
implemented the same asTupleContains
). -
Lines now don't pass
this
to the global function call anymore. Instead, all calls to functions have been converted to static calls, using theDerived
datatype. -
Global and host function calls don't pass
this
anymore. Instead, they pass theirm_properties
object, since the need of the object instance was only to be able to access its properties.
Rework of memory managers
-
Memory managers now have proper visibility in their members.
-
A
m_name
better identifies the memory managers and is used when printing its state (eg. when invoking./Allen -p 1
). -
Memory managers now dispatch the pointers when reserving (as opposed to an offset like before). This better decouples the functionality from Argument Managers, which used to store the
base pointers
before. Now, memory managers store (if relevant) base pointers. -
Memory managers have become templated: One can chose between a
Host
orDevice
memory manager, and between aSingleAlloc
orMultiAlloc
memory manager.-
SingleAlloc
memory managers act like before. -
MultiAlloc
memory managers dispatch malloc or free to the underlying memory manager implementation. This is rather slow, but it is much better for finding out out-of-bound writes. TheMultiAlloc
memory managers can be compiled with the added cmake optionMALLOC_ENGINE
. Choosing-DMALLOC_ENGINE=MULTI_ALLOC
will trigger using the new memory manager.
-