contributing.md 13.9 KB
Newer Older
1
2
Allen: Adding a new device algorithm
====================================
Daniel Campora's avatar
Daniel Campora committed
3

4
This tutorial will guide you through adding a new device algorithm to the `Allen` framework.
Daniel Campora's avatar
Daniel Campora committed
5
6
7
8

SAXPY
-----

9
Writing functions to be executed in the device in `Allen` is literally the same as writing a CUDA kernel. Therefore, you may use any existing tutorial or documentation on how to write good CUDA code.
Daniel Campora's avatar
Daniel Campora committed
10

11
Writing a device algorithm in the `Allen` framework has been made to resemble the Gaudi syntax, where possible.
Daniel Campora's avatar
Daniel Campora committed
12

13
14
15
Let's assume that we want to run the following classic `SAXPY` CUDA kernel, taken out from this website https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/ :

```c++
Daniel Campora's avatar
Daniel Campora committed
16
17
18
19
20
21
__global__ void saxpy(float *x, float *y, int n, float a) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}
```

22
### Adding the device algorithm
Daniel Campora's avatar
Daniel Campora committed
23

24
We want to add the algorithm to a specific folder inside the `device` folder:
Daniel Campora's avatar
Daniel Campora committed
25

Daniel Campora's avatar
Daniel Campora committed
26
```console
27
28
device
├── associate
Daniel Campora's avatar
Daniel Campora committed
29
│   ├── CMakeLists.txt
30
31
32
33
34
35
36
37
38
39
40
41
42
43
│   ├── include
│   │   ├── AssociateConstants.cuh
│   │   └── VeloPVIP.cuh
│   └── src
│       └── VeloPVIP.cu
├── CMakeLists.txt
├── event_model
│   ├── associate
│   │   └── include
│   │       └── AssociateConsolidated.cuh
│   ├── common
│   │   └── include
│   │       ├── ConsolidatedTypes.cuh
│   │       └── States.cuh
Daniel Campora's avatar
Daniel Campora committed
44
45
46
...
```

47
Let's create a new folder inside the `device` directory named `example`. We need to modify `device/CMakeLists.txt` to reflect this:
Daniel Campora's avatar
Daniel Campora committed
48

Daniel Campora's avatar
Daniel Campora committed
49
```cmake
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
50
51
add_subdirectory(raw_banks)
add_subdirectory(example)
Daniel Campora's avatar
Daniel Campora committed
52
53
```

54
Inside the `example` folder we will create the following structure:
Daniel Campora's avatar
Daniel Campora committed
55

Daniel Campora's avatar
Daniel Campora committed
56
```console
57
├── example
Daniel Campora's avatar
Daniel Campora committed
58
│   ├── CMakeLists.txt
59
60
61
62
│   ├── include
│   │   └── Saxpy_example.cuh
│   └── src
│       └── Saxpy_example.cu
Daniel Campora's avatar
Daniel Campora committed
63
64
```

65
The newly created `example/CMakeLists.txt` file should reflect the project we are creating. We can do that by populating it like so:
Daniel Campora's avatar
Daniel Campora committed
66

Daniel Campora's avatar
Daniel Campora committed
67
```cmake
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
68
69
70
file(GLOB saxpy_sources "src/*cu")

include_directories(include)
71
72
73
include_directories(${CMAKE_SOURCE_DIR}/device/velo/common/include)
include_directories(${CMAKE_SOURCE_DIR}/device/event_model/common/include)
include_directories(${CMAKE_SOURCE_DIR}/device/event_model/velo/include)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
74
include_directories(${CMAKE_SOURCE_DIR}/main/include)
75
include_directories(${CMAKE_SOURCE_DIR}/stream/gear/include)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
76
include_directories(${CMAKE_SOURCE_DIR}/stream/sequence/include)
Daniel Campora's avatar
Daniel Campora committed
77

78
allen_add_device_library(Examples STATIC
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
79
  ${saxpy_sources}
Daniel Campora's avatar
Daniel Campora committed
80
81
)
```
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
82
83
84
85
The includes of Velo and event model files are only necessary because we will use the number of Velo tracks per event as an input to the saxpy algorithm.
If your new algorithm does not use any Velo related objects, this is not necessary.

The includes of main, gear and sequence are required for any new algorithm in Allen.
Daniel Campora's avatar
Daniel Campora committed
86

87
Link the new library "Examples" to the stream library in `stream/CMakeLists.txt`:
Daniel Campora's avatar
Daniel Campora committed
88
```cmake
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
89
90
91
92
93
94
95
96
97
98
99
100
target_link_libraries(Stream PRIVATE
  CudaCommon
  Associate
  Velo
  AllenPatPV
  PV_beamline
  HostClustering
  HostPrefixSum
  UT
  Kalman
  VertexFitter
  RawBanks
101
  Selections
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
102
103
104
105
  SciFi
  HostGEC
  Muon
  Utils
106
107
108
  Examples
  HostDataProvider
  HostInitEventList)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
109
110
```

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
111
Next, we create the header file for our algorithm `SAXPY_example.cuh`, which is similar to an algorithm definition in Gaudi: inputs, outputs and properties are defined, as well as the algorithm function itself and an operator calling the function.
112

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
113
There are slight differences to Gaudi, since we want to be able to run the algorithm on a GPU.
114
The full file is under `device/example/include/SAXPY_example.cuh`. Let's take a look at the components:
Daniel Campora's avatar
Daniel Campora committed
115

116
```c++
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
117
#pragma once
Daniel Campora's avatar
Daniel Campora committed
118

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
119
120
121
#include "VeloConsolidated.cuh"
#include "DeviceAlgorithm.cuh"
```
Daniel Campora's avatar
Daniel Campora committed
122

123
The Velo include is only required if Velo objects are used in the algorithm. `DeviceAlgorithm.cuh` defines class `DeviceAlgorithm` and some other handy resources.
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
124

125
#### Parameters and properties
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
126

127
128
```c++
namespace saxpy {
129
130
131
132
133
134
135
136
137
  struct Parameters {
    HOST_INPUT(host_number_of_events_t, unsigned) host_number_of_events;
    DEVICE_INPUT(dev_number_of_events_t, unsigned) dev_number_of_events;
    DEVICE_INPUT(dev_offsets_all_velo_tracks_t, unsigned) dev_atomics_velo;
    DEVICE_INPUT(dev_offsets_velo_track_hit_number_t, unsigned) dev_velo_track_hit_number;
    DEVICE_OUTPUT(dev_saxpy_output_t, float) dev_saxpy_output;
    PROPERTY(saxpy_scale_factor_t, "saxpy_scale_factor", "scale factor a used in a*x + y", float) saxpy_scale_factor;
    PROPERTY(block_dim_t, "block_dim", "block dimensions", DeviceDimensions) block_dim;
  };
Daniel Campora's avatar
Daniel Campora committed
138
139
```

140
In the `saxpy` namespace the parameters and properties are specified. Parameters _scope_ can either be the host or the device, and they can either be inputs or outputs. Parameters should be defined with the following convention:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
141

142
    <scope>_<io>(<name>, <type>) <identifier>;
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
143

144
Some parameter examples:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
145

146
147
148
149
* `DEVICE_INPUT(dev_offsets_all_velo_tracks_t, unsigned) dev_atomics_velo;`: Defines an input on the _device memory_. It has a name `dev_offsets_all_velo_tracks_t`, which can be later used to identify this argument. It is of type _unsigned_, which means the memory location named `dev_offsets_all_velo_tracks_t` holds `unsigned`s. The _io_ and the _type_ define the underlying type of the instance to be `<io> <type> *` -- in this case, since it is an input type, `const unsigned*`. Its identifier is `dev_atomics_velo`.
* `DEVICE_OUTPUT(dev_saxpy_output_t, float) dev_saxpy_output;`: Defines an output parameter on _device memory_, with name `dev_saxpy_output_t` and identifier `dev_saxpy_output`. Its underlying type is `float*`.
* `HOST_INPUT(host_number_of_events_t, unsigned) host_number_of_events;`: Defines an input parameter on _host memory_, with name `host_number_of_events_t` and identifier `host_number_of_events`. Its underlying type is `const unsigned*`.
* `DEVICE_INPUT(dev_number_of_events_t, unsigned) dev_number_of_events;`: Defines an input parameter on _device memory_, with name `dev_number_of_events_t` and identifier `dev_number_of_events`. Its underlying type is `const unsigned*`.
150
151
152

Properties of algorithms define constants can be configured prior to running the application. They are defined in two parts. First, they should be defined in the `DEFINE_PARAMETERS` macro following the convention:

153
    PROPERTY(<name>, <key>, <description>, <type>) <identifier>;
154

155
* `PROPERTY(saxpy_scale_factor_t, "saxpy_scale_factor", "scale factor a used in a*x + y", float) saxpy_scale_factor`: Property with name `saxpy_scale_factor_t` is of type `float`. It will be accessible through key `"saxpy_scale_factor"` in a python configuration file, and it has description `"scale factor a used in a*x + y"`. Its identifier is `saxpy_scale_factor`. Properties _underlying type_ is always the same as their type, so in this case `float`.
Daniel Campora's avatar
Daniel Campora committed
156

157
158
And second, properties should be defined inside the algorithm struct as follows:

159
    Property<_name_> _internal_name_ {this, _default_value_};
160
161
162
163
164
165
166

In the case of saxpy:

```c++
  private:
    Property<saxpy_scale_factor_t> m_saxpy_factor {this, 2.f};
    Property<block_dim_t> m_block_dim {this, {{32, 1, 1}}};
Daniel Campora's avatar
Daniel Campora committed
167
168
```

169
170
171
172
173
#### Defining an algorithm

##### SAXPY_example.cuh

An algorithm is defined by a `struct` (or `class`) that inherits from either `HostAlgorithm` or `DeviceAlgorithm`. In addition, it is convenient to also inherit from `Parameters`, to be able to easily access _identifiers_ of parameters and properties. The struct identifier is the name of the algorithm.
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
174

175
176
177
178
An algorithm must define **two methods**: `set_arguments_size` and `operator()`. Their signatures are as follows:

```c++
  struct saxpy_t : public DeviceAlgorithm, Parameters {
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
179
    void set_arguments_size(
180
      ArgumentReferences<Parameters>,
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
181
182
      const RuntimeOptions&,
      const Constants&,
183
      const HostBuffers&) const;
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
184
185

    void operator()(
186
      const ArgumentReferences<Parameters>&,
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
187
      const RuntimeOptions&,
188
      const Constants&,
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
189
      HostBuffers&,
190
      const Allen::Context& context) const;
191
192
193
194

  private:
    Property<saxpy_scale_factor_t> m_saxpy_factor {this, 2.f};
    Property<block_dim_t> m_block_dim {this, {{32, 1, 1}}};
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
195
196
  };
```
Daniel Campora's avatar
Daniel Campora committed
197

198
199
200
An algorithm `saxpy_t` has been declared. It is a `DeviceAlgorithm`, and for convenience it inherits from the previously defined `Parameters`. It defines two methods, `set_arguments_size` and `operator()` with the above predefined signatures. The algorithm declaration ends with the `private:` block for the properties mentioned before.

Since this is a DeviceAlgorithm, one would like the work to actually be done on the device. In order to run code on the device, a _global kernel_ has to be defined. The syntax used is standard CUDA:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
201

202
```
203
__global__ void saxpy(Parameters);
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
204
205
```

206
##### SAXPY_example.cu
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
207

208
The source file of SAXPY should define `set_arguments_size`, `operator()` and the previously mentioned _global kernel_ `saxpy`:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
209

210
211
* `set_arguments_size`: Sets the `size` of output parameters.
* `operator()`: The actual algorithm runs (similar to Gaudi).
Daniel Campora's avatar
Daniel Campora committed
212

213
In Allen, it is not recommended to use _dynamic memory allocations_. Therefore, types such as `std::vector` are "forbidden", and instead sizes of output arguments must be set in the `set_arguments_size` method of algorithms.
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
214

215
```c++
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
216
217
#include "SAXPY_example.cuh"

218
219
220
221
222
223
void saxpy::saxpy_t::set_arguments_size(
  ArgumentReferences<Parameters> arguments,
  const RuntimeOptions&,
  const Constants&,
  const HostBuffers&) const
{
224
  set_size<dev_saxpy_output_t>(arguments, first<host_number_of_events_t>(arguments));
225
226
227
}
```

Daniel Campora's avatar
Daniel Campora committed
228
To do that, one may use the following functions:
229

Daniel Campora's avatar
Daniel Campora committed
230
231
232
233
* `void set_size<T>(arguments, const size_t)`: Sets the size of _name_ `T`. The `sizeof(T)` is implicit, so eg. `set_size<T>(10)` will actually allocate space for `10 * sizeof(T)`.
* `size_t size<T>(arguments)`: Gets the size of _name_ `T`.
* `T* data<T>(arguments)`: Gets the pointer to the beginning of `T`.
* `T first<T>(arguments)`: Gets the first element of `T`.
234
235
236
237
238
239
240
241
242

Next, `operator()` should be defined:

```c++
void saxpy::saxpy_t::operator()(
  const ArgumentReferences<Parameters>& arguments,
  const RuntimeOptions&,
  const Constants&,
  HostBuffers&,
243
  const Allen::Context& context) const
244
245
{
  global_function(saxpy)(
Daniel Campora's avatar
Daniel Campora committed
246
    dim3(1),
247
    property<block_dim_t>(), context)(arguments);
248
249
250
251
252
}
```

In order to invoke host and global functions, wrapper methods `host_function` and `global_function` should be used. The syntax is as follows:

Daniel Campora's avatar
Daniel Campora committed
253
    host_function(<host_function_identifier>)(<parameters of function>)
254
    global_function(<global_function_identifier>)(<grid_size>, <block_size>, context)(<parameters of function>)
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271

`global_function` wraps a function identifier, such as `saxpy`. The object it returns can be used to invoke a _global kernel_ following a syntax that is similar to [CUDA's kernel invocation syntax](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels). It expects:

* `grid_size`: Number of blocks of kernel invocation (passed as 3-dimensional object of type `dim3`).
* `block_size`: Number of threads in each block (passed as 3-dimensional object of type `dim3`).
* `stream`: Stream where to run.
* `parameters of function`: Parameters of the _global kernel_ being invoked.

In this case, the kernel `saxpy` accepts only one parameter of type `Parameters`. The global_function and host_function wrappers automatically detect and transform `const ArgumentReferences<Parameters>&` into `Parameters`. Therefore, we can safely pass `arguments` to our kernel invocation.

Finally, the kernel is defined:

```c++
/**
 * @brief SAXPY example algorithm
 * @detail Calculates for every event y = a*x + x, where x is the number of velo tracks in one event
 */
272
__global__ void saxpy::saxpy(saxpy::Parameters parameters)
273
{
274
  const auto number_of_events = parameters.dev_number_of_events[0];
Daniel Campora's avatar
Daniel Campora committed
275
  for (unsigned event_number = threadIdx.x; event_number < number_of_events; event_number += blockDim.x) {
Daniel Campora's avatar
Daniel Campora committed
276
277
    Velo::Consolidated::ConstTracks velo_tracks {
      parameters.dev_atomics_velo, parameters.dev_velo_track_hit_number, event_number, number_of_events};
Daniel Campora's avatar
Daniel Campora committed
278
    const unsigned number_of_tracks_event = velo_tracks.number_of_tracks(event_number);
Daniel Campora's avatar
Daniel Campora committed
279
    
280
    parameters.dev_saxpy_output[event_number] =
Daniel Campora's avatar
Daniel Campora committed
281
        parameters.saxpy_scale_factor * number_of_tracks_event + number_of_tracks_event;
Daniel Campora's avatar
Daniel Campora committed
282
  }
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
283
284
}
```
Daniel Campora's avatar
Daniel Campora committed
285

286
287
288
289
290
291
292
293
294
295
The kernel accepts a single parameter of type `saxpy::Parameters`. It is now possible to access all previously defined parameters by their _identifier_. Things to remember here:

* A parameter or property is accessed with its _identifier_.
* Parameters decays to _underlying type_ (eg. formed from its _scope_ and its _type_).
* Properties decay to _type_.
* If explicit access to the _underlying type_ of parameters is required, `get()` can be used.
* One should not access `host` parameters inside a function to be executed on the `device`, and viceversa.

In other words, in the code above:

296
* `parameters.dev_number_of_events` decays to `const unsigned*`.
297
298
* `parameters.dev_atomics_velo` decays to `const unsigned*`.
* `parameters.dev_velo_track_hit_number` decays to `const unsigned*`.
299
300
301
* `parameters.dev_saxpy_output` decays to `float*`.
* `parameters.saxpy_scale_factor` decays to `float`, and has default value `2.f`.

302
303
304
305
306
307
308
309
How to access processing event in other algorithms
--------------------------------------------------

Typically, events are processed by independent blocks of execution. When that's the case, the invocation of the global function happens with as many blocks as events in the event list. Eg.

```c++
  global_function(kernel)(
    size<dev_event_list_t>(),
310
    property<block_dim_t>(), context)(arguments);
311
312
313
314
315
316
317
318
319
320
321
322
```

Then, in the kernel itself, in order to access the event under execution, the following idiom is used:

```c++
__global__ void kernel(namespace::Parameters parameters) {
  const unsigned event_number = parameters.dev_event_list[blockIdx.x];
```

Configuring the algorithm in a sequence
---------------------------------------

323
324
The last thing remaining is to add the algorithm to a sequence, and run it.

Daniel Campora's avatar
Daniel Campora committed
325
* [This readme](configuration/readme.md) explains how to configure the algorithms in an HLT1 sequence.