Draft: [RFC] New way to define parameters in Allen
Disclosure: This MR is a proof of principle of an improvement upon the current syntax of Allen. It is not intended for merging, however it is a draft MR to show the changes wrt the current codebase. It is also intended to show what is possible for the future.
What is this?
This is a proposal for a new store for Allen, with a number of implications in how Allen parameters would be defined. It only affects how parameters are defined and passed, and it doesn't change the bulk of Allen code in any other way (eg. no algorithm code is affected).
Let's consider a simple Allen algorithm like the prefix sum. Its declaration becomes with this MR:
namespace host_prefix_sum {
struct host_prefix_sum_t : public HostAlgorithm {
std::tuple<
host_buffer<unsigned, "host_total_sum_holder_t">,
host_buffer<unsigned, "host_output_buffer_t">,
device_buffer<unsigned, "dev_output_buffer_t">> operator()(
const device_buffer<unsigned, "dev_input_buffer_t">&,
Allen::Context&) const;
};
} // namespace host_prefix_sum
This algorithm gets a device_buffer as an input, which holds unsigned
s and has key dev_input_buffer_t
. The algorithm outputs three buffers, namely: a host_buffer of type unsigned
and key host_total_sum_holder_t
, another host_buffer of type unsigned
and key host_output_buffer_t
, and a device_buffer of type unsigned
and key dev_output_buffer_t
.
The definition of that algorithm becomes:
std::tuple<
host_buffer<unsigned, "host_total_sum_holder_t">,
host_buffer<unsigned, "host_output_buffer_t">,
device_buffer<unsigned, "dev_output_buffer_t">>
host_prefix_sum::host_prefix_sum_t::operator()(
const device_buffer<unsigned, "dev_input_buffer_t">& dev_input_buffer,
Allen::Context& context) const
{
auto host_total_sum_holder = make_host_buffer<unsigned, "host_total_sum_holder_t">(context, 1);
auto host_output_buffer = make_host_buffer<unsigned, "host_output_buffer_t">(context, dev_input_buffer.size() + 1);
auto dev_output_buffer = make_device_buffer<unsigned, "dev_output_buffer_t">(context, dev_input_buffer.size() + 1);
// Copy data over to the host
host_output_buffer[0] = 0;
Allen::copy(
host_output_buffer.get(), dev_input_buffer.get(), context, Allen::memcpyDeviceToDevice, dev_input_buffer.size(), 1, 0);
host_prefix_sum_impl(host_output_buffer.data(), host_output_buffer.size(), host_total_sum_holder.data());
// Copy prefix summed data to the device
Allen::copy_async(dev_output_buffer.get(), host_output_buffer.get(), context, Allen::memcpyHostToDevice);
return {std::move(host_total_sum_holder), std::move(host_output_buffer), std::move(dev_output_buffer)};
}
A similar concept to Gaudi's store is employed here. One needs to define a variable of the type that is moved to the store by the end of the algorithm's execution. This is achieved by declaring the required host and device buffers:
auto host_total_sum_holder = make_host_buffer<unsigned, "host_total_sum_holder_t">(context, 1);
auto host_output_buffer = make_host_buffer<unsigned, "host_output_buffer_t">(context, dev_input_buffer.size() + 1);
auto dev_output_buffer = make_device_buffer<unsigned, "dev_output_buffer_t">(context, dev_input_buffer.size() + 1);
and move-return them at the end of the algorithm's execution:
return {std::move(host_total_sum_holder), std::move(host_output_buffer), std::move(dev_output_buffer)};
Each datatype is of type named_buffer<SCOPE, T, KEY>
, which contains information about the scope (host or device), type and key that the user wants. Adding the key as part of the buffer datatype makes the return statement safer, as incompatible strings will not match one another.
named_buffer:
The named_buffer<SCOPE, T, KEY>
has three typedefs given for simplicity:
-
device_buffer<T, KEY>
-- anamed_buffer
withSCOPE=device
-
host_buffer<T, KEY>
-- anamed_buffer
withSCOPE=host
-
mask_buffer<KEY>
-- anamed_buffer
ofSCOPE=device
andT=mask_t
. It has the similar restrictions and functionality asMASK
, that is: there can only be at most onemask_t
object in the input list and one in the output list, and the scheduler will use this object to store the calculated control flow, eg. which events are still being processed.
named_buffer
s are an evolution upon the existing buffer class, which has the following features:
- It can be constructed or move-constructed, but it cannot be copied.
- Upon deletion, it deletes its counterpart in the respective memory manager.
One can easily predict what will be the tag of a named_buffer
inside the SCOPE
memory manager. The tag is of the form: <algorithm name>__<name>
. One of the (good) implications this has is that the user is required to always give a different name to the datatypes inside an algorithm. Another (good) implication is that the tag can never clash with another algorithm's datatypes due to the prefix.
Eg. the following code would throw a memory manager exception:
void MyAlgo::operator()() {
auto a = make_host_buffer<unsigned, "a">(context, 1);
auto b = make_device_buffer<unsigned, "a">(context, 1);
}
Note that even if one defines a host
and a device
buffer, keys must always be unique within the context of the algorithm.
For simplicity, two functions are provided to create buffers:
-
make_host_buffer<T, KEY>(Context& context, size_t size)
-- Creates a host buffer with typeT
, keyKEY
and sizesize
. -
make_device_buffer<T, KEY>(Context& context, size_t size)
-- Creates a device buffer with typeT
, keyKEY
and sizesize
.
named_buffer<SCOPE, T, KEY>
provide several functions that are relevant when operating with them:
-
void resize(size_t size)
-- Resizes the buffer (triggers afree
and areserve
) -
void reduce_size(size_t size)
-- Reduces the size of the current buffer (does not triggerfree
norreserve
) -
std::span<T> get()
andstd::span<const T> get() const
-- Convert the buffer to a span
And other than that, a buffer behaves similarly to a span
(it provides begin()
, end()
, data()
, size()
, size_bytes()
, subspan()
, etc.)
kernel invocations:
Kernel invocations (ie. invocations of functions marked __global__
) can be done similarly to before, however it is now required to pass each of the buffers explicitly. Here is the new declaration of Search by triplet:
__global__ void velo_search_by_triplet_kernel(
std::span<const mask_t> dev_event_list,
std::span<const unsigned> dev_number_of_events,
std::span<const char> dev_sorted_velo_cluster_container,
std::span<const unsigned> dev_offsets_estimated_input_size,
std::span<const unsigned> dev_module_cluster_num,
std::span<const Velo::Clusters>,
std::span<Velo::TrackHits> dev_tracks,
std::span<Velo::TrackletHits> dev_three_hit_tracks,
std::span<Velo::TrackletHits> dev_tracklets,
std::span<unsigned> dev_tracks_to_follow,
std::span<bool> dev_hit_used,
std::span<unsigned> dev_atomics_velo,
std::span<unsigned> dev_number_of_velo_tracks,
std::span<unsigned short> dev_rel_indices,
const VeloGeometry* dev_velo_geometry,
const float phi_tolerance,
const float max_scatter,
const unsigned max_skipped_modules)
and its invocation of Search by triplet:
new_global_function(velo_search_by_triplet_kernel)(dev_event_list.size(), m_block_dim_x.value(), context)(
dev_event_list,
dev_number_of_events,
dev_sorted_velo_cluster_container,
dev_offsets_estimated_input_size,
dev_module_cluster_num,
dev_velo_clusters,
dev_tracks,
dev_three_hit_tracks,
dev_tracklets,
dev_tracks_to_follow,
dev_hit_used,
dev_atomics_velo,
dev_number_of_velo_tracks,
dev_rel_indices,
constants.dev_velo_geometry,
m_tolerance,
m_max_scatter,
m_max_skipped_modules);
It is worth noting that the outputs decay automatically to std::span<T>
, whereas the inputs decay to std::span<const T>
.
Lines:
A proof of principle to support lines is also available with this MR. Defining lines now requires:
- Defining a
struct DeviceParameters
that encapsulates the parameters required on the device. - Defining a
operator()
with the inputs and outputs required in the line, which creates the encapsulated object and passes the control on toLine<Derived>::operator()
.
GatherSelections still requires support of InputAggregates, but conceptually that should work.
Changes wrt current Allen:
One of the defining features of Allen is its separation of set_arguments_size
and operator()
, whereby the reserve / free occurs behind the scenes. This would not anymore be the case with this MR, as all the algorithm occurs now in the operator()
. It's worth pointing out that it is already possible to resize buffers with the buffer class, which is used in a few algorithms.
The restriction of not resizing buffers still applies to kernel code, which is anyway a good practice in GPU programming which will unlikely change. Memory is expensive after all.
The definition of the operator()
accepts a few special datatypes which now become optional:
-
Context&
-- it used to only contain the execution stream, but now it is extended to also contain the host and memory managers (hence notconst
) const RuntimeOptions&
const Constants&
Separately, all free-standing operators used previously in Allen (such as size<T>(arguments)
, set_size<T>(arguments, size)
, data<T>(arguments)
, etc.) become unnecessary.
A few other design decisions would have still to be made for the following points, but I don't see any roadblocks:
- Datatype dependencies are not supported, but support could easily be added by optionally specifying a list of dependencies identified by the keys in datatypes (eg.
host_buffer<unsigned, "host_total_sum_holder_t", depends_on("host_a_t", "host_b_t", "dev_a_t")>
) - Dynamic shared memory (which used a
config
object to make it compatible with CPUs) needs to be explicitly passed in kernel invocations - Monitoring / tupling support should be added back into the lines. There is conceptually no impediment here, but rather one needs a documented repeatable pattern
Given the refactoring effort this would require, this MR only implements a demonstrator for the VELO decoding subsequences. Both velo
and velo_validation
are available for testing. An example with a line is included with velo_line
.