WIP: New algorithm definition
This MR changes (almost completely) the way to define Allen algorithms. It introduces the following functionality:
Arguments:
-
Argument
files are (will be) gone. - Algorithm arguments are specified alongside the algorithm definition.
- Host and device arguments are supported and managed by separate memory manager instances.
- An argument is defined as a struct (type), it has an inner type, and it can either be host or device, and either input or output.
- The argument itself can be used in a function definition, thus removing the need to specify the argument's type repeatedly.
- Arguments are safer: The
type
and whether it is input or output is carried over. An input argument is by definitionconst
.
Ie.:
namespace velo_estimate_input_size {
// Arguments
HOST_INPUT(host_number_of_selected_events_t, uint)
DEVICE_INPUT(dev_event_list_t, uint)
DEVICE_OUTPUT(dev_velo_raw_input_t, char)
DEVICE_OUTPUT(dev_velo_raw_input_offsets_t, uint)
DEVICE_OUTPUT(dev_estimated_input_size_t, uint)
DEVICE_OUTPUT(dev_module_candidate_num_t, uint)
DEVICE_OUTPUT(dev_cluster_candidates_t, uint)
// Global function
__global__ void velo_estimate_input_size(
dev_velo_raw_input_t dev_velo_raw_input,
dev_velo_raw_input_offsets_t dev_velo_raw_input_offsets,
dev_estimated_input_size_t dev_estimated_input_size,
dev_module_candidate_num_t dev_module_candidate_num,
dev_cluster_candidates_t dev_cluster_candidates,
dev_event_list_t dev_event_list,
uint8_t* candidate_ks);
...
Algorithms:
- Handlers are gone. Visitors are gone.
- Algorithms are now defined with a struct that inherits from either
DeviceAlgorithm
orHostAlgorithm
. The only difference is thatDeviceAlgorithm
has a built-in configurable grid_dimension and block_dimension that can be used (similarly toHandlers
versusCpuHandlers
before). - Algorithms are expected to have a name (
constexpr auto name {"some name"}
). - Algorithms can store any number of function wrappers. Each function wrapper can either be a
host_function
or aglobal_function
. - Algorithms are expected to have a
set_arguments_size
and anoperator()
. Similarly to visitors before them,set_arguments_size
is expected to set the size of the output arguments.operator()
is where the algorithm's body is expected (similar name to Gaudi). Function wrappers can be used to invoke the functions (more later). - Algorithms are templated now. They expect an
std::tuple
with argument types. These argument types are accessed by checking for the elements they inherit from (more later).
Ie.
// Algorithm
template<typename Arguments>
struct velo_estimate_input_size_t : public DeviceAlgorithm
{
constexpr static auto name {"velo_estimate_input_size_t"};
decltype(global_function(velo_estimate_input_size)) function {velo_estimate_input_size};
void set_arguments_size(
ArgumentRefManager<Arguments> arguments,
const RuntimeOptions& runtime_options,
const Constants& constants,
const HostBuffers& host_buffers) const
{
...
}
void operator()(
const ArgumentRefManager<Arguments>& arguments,
const RuntimeOptions& runtime_options,
const Constants& constants,
HostBuffers& host_buffers,
cudaStream_t& cuda_stream,
cudaEvent_t& cuda_generic_event) const
{
...
}
};
Function wrappers:
- Function wrappers have been simplified. There are no more
set_arguments
norset_opts
. - CpuFunction are (will be) now called
HostFunction
(which is more correct). - GpuFunction are (will be) now called
GlobalFunction
(DeviceFunction
unfortunately has another meaning in CUDA terminology). - In order to make a function wrapper, the helpers
host_function
andglobal_function
can be used. - Function wrappers only provide
operator()
. In the case of aHostFunction
, this invokes the function. - In the case of a
GlobalFunction
,operator()
accepts the grid dimension, block dimension and CUDA stream, and returns a closure to the actual function. This mimics the syntax of CUDA function calls.
Ie.
// Algorithm
template<typename Arguments>
struct velo_estimate_input_size_t : public DeviceAlgorithm
{
constexpr static auto name {"velo_estimate_input_size_t"};
decltype(global_function(velo_estimate_input_size)) function {velo_estimate_input_size};
// [...]
void operator()(
const ArgumentRefManager<Arguments>& arguments,
const RuntimeOptions& runtime_options,
const Constants& constants,
HostBuffers& host_buffers,
cudaStream_t& cuda_stream,
cudaEvent_t& cuda_generic_event) const
{
// [...]
// Invoke kernel
function(dim3(offset<host_number_of_selected_events_t>(arguments)[0]), block_dimension(), cuda_stream)(
offset<dev_velo_raw_input_t>(arguments),
offset<dev_velo_raw_input_offsets_t>(arguments),
offset<dev_estimated_input_size_t>(arguments),
offset<dev_module_candidate_num_t>(arguments),
offset<dev_cluster_candidates_t>(arguments),
offset<dev_event_list_t>(arguments),
constants.dev_velo_candidate_ks.data());
}
Configurable parameters to algorithms:
- Algorithms must now be configured with parameters.
- Parameters can be organized in any manner desired.
- Host and Device arguments can be used.
- Parameters are picked up according to whether they inherited from the expected type.
- For instance, as a consequence, now there it is possible to have only one prefix sum algorithm and reuse it with different parameters.
Ie.
ARG(host_event_list_t, host_global_event_cut::host_event_list_t)
ARG(host_number_of_selected_events_t,
host_global_event_cut::host_number_of_selected_events_t,
velo_estimate_input_size::host_number_of_selected_events_t,
velo_consolidate_tracks::host_number_of_selected_events_t)
...
SEQUENCE_T(
host_global_event_cut::host_global_event_cut_t<
std::tuple<dev_event_list_t, host_event_list_t, host_number_of_selected_events_t>>,
velo_estimate_input_size::velo_estimate_input_size_t<std::tuple<
host_number_of_selected_events_t,
dev_velo_raw_input_t,
dev_velo_raw_input_offsets_t,
dev_estimated_input_size_t,
dev_module_candidate_num_t,
dev_cluster_candidates_t,
dev_event_list_t>>,
Host and device memory managers:
- Two memory managers are now initialized. By default, the host memory manager has 10~MB.
- So far, the host memory manager is only used for single
uint
s which store the size of values (ie. the total size of each prefix sum in the sequence). - Only a single object of type
ArgumentRefManager<Arguments>
is passed to set sizes, read offsets or sizes. - The choice of the memory manager is performed through SFINAE.
- The current design is extensible and allows for more memory managers to be added (ie. for types used only when validation is on).
Even though the MR is rather big, I don't see how these functionalities could be added in chunks without making it more confusing to developers.
Thanks a lot to @clemenci , @frankm and @nnolte for the discussions on varios design aspects.
TODO before merging:
- Move all algorithms to templated syntax.
- Generate a default sequence configuration.
- Verify immutability condition and adapt necessary algorithms (ie. any parameter is an output parameter only once).
- Check at compile time all outputs preceed any input use.
Separate items that can spark from this MR:
- Expose configuration in an easy way to configure to the developer.
Edited by Daniel Hugo Campora Perez