contributing.md 15.4 KB
Newer Older
1
2
Allen: Adding a new device algorithm
====================================
Daniel Campora's avatar
Daniel Campora committed
3

4
This tutorial will guide you through adding a new device algorithm to the `Allen` framework.
Daniel Campora's avatar
Daniel Campora committed
5
6
7
8

SAXPY
-----

9
Writing functions to be executed in the device in `Allen` is literally the same as writing a CUDA kernel. Therefore, you may use any existing tutorial or documentation on how to write good CUDA code.
Daniel Campora's avatar
Daniel Campora committed
10

11
Writing a device algorithm in the `Allen` framework has been made to resemble the Gaudi syntax, where possible.
Daniel Campora's avatar
Daniel Campora committed
12

13
14
15
Let's assume that we want to run the following classic `SAXPY` CUDA kernel, taken out from this website https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/ :

```c++
Daniel Campora's avatar
Daniel Campora committed
16
17
18
19
20
21
__global__ void saxpy(float *x, float *y, int n, float a) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}
```

22
### Adding the device algorithm
Daniel Campora's avatar
Daniel Campora committed
23

24
We want to add the algorithm to a specific folder inside the `device` folder:
Daniel Campora's avatar
Daniel Campora committed
25

Daniel Campora's avatar
Daniel Campora committed
26
```console
27
28
device
├── associate
Daniel Campora's avatar
Daniel Campora committed
29
│   ├── CMakeLists.txt
30
31
32
33
34
35
36
37
38
39
40
41
42
43
│   ├── include
│   │   ├── AssociateConstants.cuh
│   │   └── VeloPVIP.cuh
│   └── src
│       └── VeloPVIP.cu
├── CMakeLists.txt
├── event_model
│   ├── associate
│   │   └── include
│   │       └── AssociateConsolidated.cuh
│   ├── common
│   │   └── include
│   │       ├── ConsolidatedTypes.cuh
│   │       └── States.cuh
Daniel Campora's avatar
Daniel Campora committed
44
45
46
...
```

47
Let's create a new folder inside the `device` directory named `example`. We need to modify `device/CMakeLists.txt` to reflect this:
Daniel Campora's avatar
Daniel Campora committed
48

Daniel Campora's avatar
Daniel Campora committed
49
```cmake
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
50
51
add_subdirectory(raw_banks)
add_subdirectory(example)
Daniel Campora's avatar
Daniel Campora committed
52
53
```

54
Inside the `example` folder we will create the following structure:
Daniel Campora's avatar
Daniel Campora committed
55

Daniel Campora's avatar
Daniel Campora committed
56
```console
57
├── example
Daniel Campora's avatar
Daniel Campora committed
58
│   ├── CMakeLists.txt
59
60
61
62
│   ├── include
│   │   └── Saxpy_example.cuh
│   └── src
│       └── Saxpy_example.cu
Daniel Campora's avatar
Daniel Campora committed
63
64
```

65
The newly created `example/CMakeLists.txt` file should reflect the project we are creating. We can do that by populating it like so:
Daniel Campora's avatar
Daniel Campora committed
66

Daniel Campora's avatar
Daniel Campora committed
67
```cmake
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
68
69
70
file(GLOB saxpy_sources "src/*cu")

include_directories(include)
71
72
73
include_directories(${CMAKE_SOURCE_DIR}/device/velo/common/include)
include_directories(${CMAKE_SOURCE_DIR}/device/event_model/common/include)
include_directories(${CMAKE_SOURCE_DIR}/device/event_model/velo/include)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
74
include_directories(${CMAKE_SOURCE_DIR}/main/include)
75
include_directories(${CMAKE_SOURCE_DIR}/stream/gear/include)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
76
include_directories(${CMAKE_SOURCE_DIR}/stream/sequence/include)
Daniel Campora's avatar
Daniel Campora committed
77

78
allen_add_device_library(Examples STATIC
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
79
  ${saxpy_sources}
Daniel Campora's avatar
Daniel Campora committed
80
81
)
```
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
82
83
84
85
The includes of Velo and event model files are only necessary because we will use the number of Velo tracks per event as an input to the saxpy algorithm.
If your new algorithm does not use any Velo related objects, this is not necessary.

The includes of main, gear and sequence are required for any new algorithm in Allen.
Daniel Campora's avatar
Daniel Campora committed
86

87
Link the new library "Examples" to the stream library in `stream/CMakeLists.txt`:
Daniel Campora's avatar
Daniel Campora committed
88
```cmake
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
89
90
91
92
93
94
95
96
97
98
99
100
target_link_libraries(Stream PRIVATE
  CudaCommon
  Associate
  Velo
  AllenPatPV
  PV_beamline
  HostClustering
  HostPrefixSum
  UT
  Kalman
  VertexFitter
  RawBanks
101
  Selections
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
102
103
104
105
  SciFi
  HostGEC
  Muon
  Utils
106
107
108
  Examples
  HostDataProvider
  HostInitEventList)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
109
110
```

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
111
Next, we create the header file for our algorithm `SAXPY_example.cuh`, which is similar to an algorithm definition in Gaudi: inputs, outputs and properties are defined, as well as the algorithm function itself and an operator calling the function.
112

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
113
There are slight differences to Gaudi, since we want to be able to run the algorithm on a GPU.
114
The full file is under `device/example/include/SAXPY_example.cuh`. Let's take a look at the components:
Daniel Campora's avatar
Daniel Campora committed
115

116
```c++
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
117
#pragma once
Daniel Campora's avatar
Daniel Campora committed
118

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
119
#include "VeloConsolidated.cuh"
120
#include "AlgorithmTypes.cuh"
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
121
```
Daniel Campora's avatar
Daniel Campora committed
122

123
The Velo include is only required if Velo objects are used in the algorithm. `DeviceAlgorithm.cuh` defines class `DeviceAlgorithm` and some other handy resources.
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
124

125
#### Parameters and properties
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
126

127
128
```c++
namespace saxpy {
129
130
131
132
133
134
135
136
137
  struct Parameters {
    HOST_INPUT(host_number_of_events_t, unsigned) host_number_of_events;
    DEVICE_INPUT(dev_number_of_events_t, unsigned) dev_number_of_events;
    DEVICE_INPUT(dev_offsets_all_velo_tracks_t, unsigned) dev_atomics_velo;
    DEVICE_INPUT(dev_offsets_velo_track_hit_number_t, unsigned) dev_velo_track_hit_number;
    DEVICE_OUTPUT(dev_saxpy_output_t, float) dev_saxpy_output;
    PROPERTY(saxpy_scale_factor_t, "saxpy_scale_factor", "scale factor a used in a*x + y", float) saxpy_scale_factor;
    PROPERTY(block_dim_t, "block_dim", "block dimensions", DeviceDimensions) block_dim;
  };
Daniel Campora's avatar
Daniel Campora committed
138
139
```

140
In the `saxpy` namespace the parameters and properties are specified. Parameters _scope_ can either be the host or the device, and they can either be inputs or outputs. Parameters should be defined with the following convention:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
141

142
    <scope>_<io>(<name>, <type>) <identifier>;
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
143

144
Some parameter examples:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
145

146
147
148
149
* `DEVICE_INPUT(dev_offsets_all_velo_tracks_t, unsigned) dev_atomics_velo;`: Defines an input on the _device memory_. It has a name `dev_offsets_all_velo_tracks_t`, which can be later used to identify this argument. It is of type _unsigned_, which means the memory location named `dev_offsets_all_velo_tracks_t` holds `unsigned`s. The _io_ and the _type_ define the underlying type of the instance to be `<io> <type> *` -- in this case, since it is an input type, `const unsigned*`. Its identifier is `dev_atomics_velo`.
* `DEVICE_OUTPUT(dev_saxpy_output_t, float) dev_saxpy_output;`: Defines an output parameter on _device memory_, with name `dev_saxpy_output_t` and identifier `dev_saxpy_output`. Its underlying type is `float*`.
* `HOST_INPUT(host_number_of_events_t, unsigned) host_number_of_events;`: Defines an input parameter on _host memory_, with name `host_number_of_events_t` and identifier `host_number_of_events`. Its underlying type is `const unsigned*`.
* `DEVICE_INPUT(dev_number_of_events_t, unsigned) dev_number_of_events;`: Defines an input parameter on _device memory_, with name `dev_number_of_events_t` and identifier `dev_number_of_events`. Its underlying type is `const unsigned*`.
150
151
152

Properties of algorithms define constants can be configured prior to running the application. They are defined in two parts. First, they should be defined in the `DEFINE_PARAMETERS` macro following the convention:

153
    PROPERTY(<name>, <key>, <description>, <type>) <identifier>;
154

155
* `PROPERTY(saxpy_scale_factor_t, "saxpy_scale_factor", "scale factor a used in a*x + y", float) saxpy_scale_factor`: Property with name `saxpy_scale_factor_t` is of type `float`. It will be accessible through key `"saxpy_scale_factor"` in a python configuration file, and it has description `"scale factor a used in a*x + y"`. Its identifier is `saxpy_scale_factor`. Properties _underlying type_ is always the same as their type, so in this case `float`.
Daniel Campora's avatar
Daniel Campora committed
156

157
158
And second, properties should be defined inside the algorithm struct as follows:

159
    Property<_name_> _internal_name_ {this, _default_value_};
160
161
162
163
164
165
166

In the case of saxpy:

```c++
  private:
    Property<saxpy_scale_factor_t> m_saxpy_factor {this, 2.f};
    Property<block_dim_t> m_block_dim {this, {{32, 1, 1}}};
Daniel Campora's avatar
Daniel Campora committed
167
168
```

169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
#### Views

A view is a parameter that is linked to other parameters. It extends the lifetime of the parameters it is linked to, ensuring that the data it links to will not be freed.

A view can be defined like a parameter with additional types:

    <scope>_<io>(<name>, <type>, <linked_lifetime_type_1>, <linked_lifetime_type_2>, ...) <identifier>;

Here is a working example:

```c++
    DEVICE_OUTPUT(
      dev_velo_clusters_t,
      Velo::Clusters, dev_velo_cluster_container_t, dev_module_cluster_num_t, dev_number_of_events_t)
    dev_velo_clusters;
```

The type `dev_velo_clusters_t` is defined to be of type `Velo::Clusters`, with its lifetime linked to types `dev_velo_cluster_container_t, dev_module_cluster_num_t, dev_number_of_events_t`. That is, if `dev_velo_clusters_t` is used in a subsequent algorithm as an input, the parameters `dev_velo_cluster_container_t, dev_module_cluster_num_t, dev_number_of_events_t` are guaranteed to be in memory.

This type can be used just like any other type:

```c++
  auto velo_cluster_container = Velo::Clusters {parameters.dev_velo_cluster_container, estimated_number_of_clusters};
  parameters.dev_velo_clusters[event_number] = velo_cluster_container;
```

And subsequent algorithms can request it with no need to specify it as a view anymore:

```
  DEVICE_INPUT(dev_velo_clusters_t, Velo::Clusters) dev_velo_clusters;
```

The reason these two types are compatible is because the `Allen underlying type` of both the view and non-view parameter is `Velo::Clusters`.

203
204
205
206
207
#### Defining an algorithm

##### SAXPY_example.cuh

An algorithm is defined by a `struct` (or `class`) that inherits from either `HostAlgorithm` or `DeviceAlgorithm`. In addition, it is convenient to also inherit from `Parameters`, to be able to easily access _identifiers_ of parameters and properties. The struct identifier is the name of the algorithm.
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
208

209
210
211
212
An algorithm must define **two methods**: `set_arguments_size` and `operator()`. Their signatures are as follows:

```c++
  struct saxpy_t : public DeviceAlgorithm, Parameters {
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
213
    void set_arguments_size(
214
      ArgumentReferences<Parameters>,
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
215
216
      const RuntimeOptions&,
      const Constants&,
217
      const HostBuffers&) const;
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
218
219

    void operator()(
220
      const ArgumentReferences<Parameters>&,
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
221
      const RuntimeOptions&,
222
      const Constants&,
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
223
      HostBuffers&,
224
      const Allen::Context& context) const;
225
226
227
228

  private:
    Property<saxpy_scale_factor_t> m_saxpy_factor {this, 2.f};
    Property<block_dim_t> m_block_dim {this, {{32, 1, 1}}};
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
229
230
  };
```
Daniel Campora's avatar
Daniel Campora committed
231

232
233
234
An algorithm `saxpy_t` has been declared. It is a `DeviceAlgorithm`, and for convenience it inherits from the previously defined `Parameters`. It defines two methods, `set_arguments_size` and `operator()` with the above predefined signatures. The algorithm declaration ends with the `private:` block for the properties mentioned before.

Since this is a DeviceAlgorithm, one would like the work to actually be done on the device. In order to run code on the device, a _global kernel_ has to be defined. The syntax used is standard CUDA:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
235

236
```
237
__global__ void saxpy(Parameters);
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
238
239
```

240
##### SAXPY_example.cu
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
241

242
The source file of SAXPY should define `set_arguments_size`, `operator()` and the previously mentioned _global kernel_ `saxpy`:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
243

244
245
* `set_arguments_size`: Sets the `size` of output parameters.
* `operator()`: The actual algorithm runs (similar to Gaudi).
Daniel Campora's avatar
Daniel Campora committed
246

247
In Allen, it is not recommended to use _dynamic memory allocations_. Therefore, types such as `std::vector` are "forbidden", and instead sizes of output arguments must be set in the `set_arguments_size` method of algorithms.
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
248

249
```c++
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
250
251
#include "SAXPY_example.cuh"

252
253
254
255
256
257
void saxpy::saxpy_t::set_arguments_size(
  ArgumentReferences<Parameters> arguments,
  const RuntimeOptions&,
  const Constants&,
  const HostBuffers&) const
{
258
  set_size<dev_saxpy_output_t>(arguments, first<host_number_of_events_t>(arguments));
259
260
261
}
```

Daniel Campora's avatar
Daniel Campora committed
262
To do that, one may use the following functions:
263

Daniel Campora's avatar
Daniel Campora committed
264
265
266
267
* `void set_size<T>(arguments, const size_t)`: Sets the size of _name_ `T`. The `sizeof(T)` is implicit, so eg. `set_size<T>(10)` will actually allocate space for `10 * sizeof(T)`.
* `size_t size<T>(arguments)`: Gets the size of _name_ `T`.
* `T* data<T>(arguments)`: Gets the pointer to the beginning of `T`.
* `T first<T>(arguments)`: Gets the first element of `T`.
268
269
270
271
272
273
274
275
276

Next, `operator()` should be defined:

```c++
void saxpy::saxpy_t::operator()(
  const ArgumentReferences<Parameters>& arguments,
  const RuntimeOptions&,
  const Constants&,
  HostBuffers&,
277
  const Allen::Context& context) const
278
279
{
  global_function(saxpy)(
Daniel Campora's avatar
Daniel Campora committed
280
    dim3(1),
281
    property<block_dim_t>(), context)(arguments);
282
283
284
285
286
}
```

In order to invoke host and global functions, wrapper methods `host_function` and `global_function` should be used. The syntax is as follows:

Daniel Campora's avatar
Daniel Campora committed
287
    host_function(<host_function_identifier>)(<parameters of function>)
288
    global_function(<global_function_identifier>)(<grid_size>, <block_size>, context)(<parameters of function>)
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305

`global_function` wraps a function identifier, such as `saxpy`. The object it returns can be used to invoke a _global kernel_ following a syntax that is similar to [CUDA's kernel invocation syntax](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels). It expects:

* `grid_size`: Number of blocks of kernel invocation (passed as 3-dimensional object of type `dim3`).
* `block_size`: Number of threads in each block (passed as 3-dimensional object of type `dim3`).
* `stream`: Stream where to run.
* `parameters of function`: Parameters of the _global kernel_ being invoked.

In this case, the kernel `saxpy` accepts only one parameter of type `Parameters`. The global_function and host_function wrappers automatically detect and transform `const ArgumentReferences<Parameters>&` into `Parameters`. Therefore, we can safely pass `arguments` to our kernel invocation.

Finally, the kernel is defined:

```c++
/**
 * @brief SAXPY example algorithm
 * @detail Calculates for every event y = a*x + x, where x is the number of velo tracks in one event
 */
306
__global__ void saxpy::saxpy(saxpy::Parameters parameters)
307
{
308
  const auto number_of_events = parameters.dev_number_of_events[0];
Daniel Campora's avatar
Daniel Campora committed
309
  for (unsigned event_number = threadIdx.x; event_number < number_of_events; event_number += blockDim.x) {
Daniel Campora's avatar
Daniel Campora committed
310
311
    Velo::Consolidated::ConstTracks velo_tracks {
      parameters.dev_atomics_velo, parameters.dev_velo_track_hit_number, event_number, number_of_events};
Daniel Campora's avatar
Daniel Campora committed
312
    const unsigned number_of_tracks_event = velo_tracks.number_of_tracks(event_number);
Daniel Campora's avatar
Daniel Campora committed
313
    
314
    parameters.dev_saxpy_output[event_number] =
Daniel Campora's avatar
Daniel Campora committed
315
        parameters.saxpy_scale_factor * number_of_tracks_event + number_of_tracks_event;
Daniel Campora's avatar
Daniel Campora committed
316
  }
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
317
318
}
```
Daniel Campora's avatar
Daniel Campora committed
319

320
321
322
323
324
325
326
327
328
329
The kernel accepts a single parameter of type `saxpy::Parameters`. It is now possible to access all previously defined parameters by their _identifier_. Things to remember here:

* A parameter or property is accessed with its _identifier_.
* Parameters decays to _underlying type_ (eg. formed from its _scope_ and its _type_).
* Properties decay to _type_.
* If explicit access to the _underlying type_ of parameters is required, `get()` can be used.
* One should not access `host` parameters inside a function to be executed on the `device`, and viceversa.

In other words, in the code above:

330
* `parameters.dev_number_of_events` decays to `const unsigned*`.
331
332
* `parameters.dev_atomics_velo` decays to `const unsigned*`.
* `parameters.dev_velo_track_hit_number` decays to `const unsigned*`.
333
334
335
* `parameters.dev_saxpy_output` decays to `float*`.
* `parameters.saxpy_scale_factor` decays to `float`, and has default value `2.f`.

336
337
338
339
340
341
342
343
How to access processing event in other algorithms
--------------------------------------------------

Typically, events are processed by independent blocks of execution. When that's the case, the invocation of the global function happens with as many blocks as events in the event list. Eg.

```c++
  global_function(kernel)(
    size<dev_event_list_t>(),
344
    property<block_dim_t>(), context)(arguments);
345
346
347
348
349
350
351
352
353
354
355
356
```

Then, in the kernel itself, in order to access the event under execution, the following idiom is used:

```c++
__global__ void kernel(namespace::Parameters parameters) {
  const unsigned event_number = parameters.dev_event_list[blockIdx.x];
```

Configuring the algorithm in a sequence
---------------------------------------

357
358
The last thing remaining is to add the algorithm to a sequence, and run it.

Daniel Campora's avatar
Daniel Campora committed
359
* [This readme](configuration/readme.md) explains how to configure the algorithms in an HLT1 sequence.