contributing.md 14 KB
Newer Older
1
2
Allen: Adding a new device algorithm
====================================
Daniel Campora's avatar
Daniel Campora committed
3

4
This tutorial will guide you through adding a new device algorithm to the `Allen` framework.
Daniel Campora's avatar
Daniel Campora committed
5
6
7
8

SAXPY
-----

9
Writing functions to be executed in the device in `Allen` is literally the same as writing a CUDA kernel. Therefore, you may use any existing tutorial or documentation on how to write good CUDA code.
Daniel Campora's avatar
Daniel Campora committed
10

11
Writing a device algorithm in the `Allen` framework has been made to resemble the Gaudi syntax, where possible.
Daniel Campora's avatar
Daniel Campora committed
12

13
14
15
Let's assume that we want to run the following classic `SAXPY` CUDA kernel, taken out from this website https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/ :

```c++
Daniel Campora's avatar
Daniel Campora committed
16
17
18
19
20
21
__global__ void saxpy(float *x, float *y, int n, float a) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}
```

22
### Adding the device algorithm
Daniel Campora's avatar
Daniel Campora committed
23

24
We want to add the algorithm to a specific folder inside the `device` folder:
Daniel Campora's avatar
Daniel Campora committed
25

Daniel Campora's avatar
Daniel Campora committed
26
```console
27
28
device
├── associate
Daniel Campora's avatar
Daniel Campora committed
29
│   ├── CMakeLists.txt
30
31
32
33
34
35
36
37
38
39
40
41
42
43
│   ├── include
│   │   ├── AssociateConstants.cuh
│   │   └── VeloPVIP.cuh
│   └── src
│       └── VeloPVIP.cu
├── CMakeLists.txt
├── event_model
│   ├── associate
│   │   └── include
│   │       └── AssociateConsolidated.cuh
│   ├── common
│   │   └── include
│   │       ├── ConsolidatedTypes.cuh
│   │       └── States.cuh
Daniel Campora's avatar
Daniel Campora committed
44
45
46
...
```

47
Let's create a new folder inside the `device` directory named `example`. We need to modify `device/CMakeLists.txt` to reflect this:
Daniel Campora's avatar
Daniel Campora committed
48

Daniel Campora's avatar
Daniel Campora committed
49
```cmake
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
50
51
add_subdirectory(raw_banks)
add_subdirectory(example)
Daniel Campora's avatar
Daniel Campora committed
52
53
```

54
Inside the `example` folder we will create the following structure:
Daniel Campora's avatar
Daniel Campora committed
55

Daniel Campora's avatar
Daniel Campora committed
56
```console
57
├── example
Daniel Campora's avatar
Daniel Campora committed
58
│   ├── CMakeLists.txt
59
60
61
62
│   ├── include
│   │   └── Saxpy_example.cuh
│   └── src
│       └── Saxpy_example.cu
Daniel Campora's avatar
Daniel Campora committed
63
64
```

65
The newly created `example/CMakeLists.txt` file should reflect the project we are creating. We can do that by populating it like so:
Daniel Campora's avatar
Daniel Campora committed
66

Daniel Campora's avatar
Daniel Campora committed
67
```cmake
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
68
69
70
file(GLOB saxpy_sources "src/*cu")

include_directories(include)
71
72
73
include_directories(${CMAKE_SOURCE_DIR}/device/velo/common/include)
include_directories(${CMAKE_SOURCE_DIR}/device/event_model/common/include)
include_directories(${CMAKE_SOURCE_DIR}/device/event_model/velo/include)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
74
include_directories(${CMAKE_SOURCE_DIR}/main/include)
75
include_directories(${CMAKE_SOURCE_DIR}/stream/gear/include)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
76
include_directories(${CMAKE_SOURCE_DIR}/stream/sequence/include)
Daniel Campora's avatar
Daniel Campora committed
77

78
allen_add_device_library(Examples STATIC
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
79
  ${saxpy_sources}
Daniel Campora's avatar
Daniel Campora committed
80
81
)
```
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
82
83
84
85
The includes of Velo and event model files are only necessary because we will use the number of Velo tracks per event as an input to the saxpy algorithm.
If your new algorithm does not use any Velo related objects, this is not necessary.

The includes of main, gear and sequence are required for any new algorithm in Allen.
Daniel Campora's avatar
Daniel Campora committed
86

87
Link the new library "Examples" to the stream library in `stream/CMakeLists.txt`:
Daniel Campora's avatar
Daniel Campora committed
88
```cmake
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
89
90
91
92
93
94
95
96
97
98
99
100
target_link_libraries(Stream PRIVATE
  CudaCommon
  Associate
  Velo
  AllenPatPV
  PV_beamline
  HostClustering
  HostPrefixSum
  UT
  Kalman
  VertexFitter
  RawBanks
101
  Selections
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
102
103
104
105
  SciFi
  HostGEC
  Muon
  Utils
106
107
108
  Examples
  HostDataProvider
  HostInitEventList)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
109
110
```

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
111
Next, we create the header file for our algorithm `SAXPY_example.cuh`, which is similar to an algorithm definition in Gaudi: inputs, outputs and properties are defined, as well as the algorithm function itself and an operator calling the function.
112

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
113
There are slight differences to Gaudi, since we want to be able to run the algorithm on a GPU.
114
The full file is under `device/example/include/SAXPY_example.cuh`. Let's take a look at the components:
Daniel Campora's avatar
Daniel Campora committed
115

116
```c++
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
117
#pragma once
Daniel Campora's avatar
Daniel Campora committed
118

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
119
120
121
#include "VeloConsolidated.cuh"
#include "DeviceAlgorithm.cuh"
```
Daniel Campora's avatar
Daniel Campora committed
122

123
The Velo include is only required if Velo objects are used in the algorithm. `DeviceAlgorithm.cuh` defines class `DeviceAlgorithm` and some other handy resources.
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
124

125
#### Parameters and properties
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
126

127
128
129
130
```c++
namespace saxpy {
  DEFINE_PARAMETERS(
    Parameters,
131
    (HOST_INPUT(host_number_of_events_t, unsigned), host_number_of_events),
132
    (DEVICE_INPUT(dev_number_of_events_t, unsigned), dev_number_of_events),
133
134
    (DEVICE_INPUT(dev_offsets_all_velo_tracks_t, unsigned), dev_atomics_velo),
    (DEVICE_INPUT(dev_offsets_velo_track_hit_number_t, unsigned), dev_velo_track_hit_number),
135
136
137
    (DEVICE_OUTPUT(dev_saxpy_output_t, float), dev_saxpy_output),
    (PROPERTY(saxpy_scale_factor_t, "saxpy_scale_factor", "scale factor a used in a*x + y", float), saxpy_scale_factor),
    (PROPERTY(block_dim_t, "block_dim", "block dimensions", DeviceDimensions), block_dim))
Daniel Campora's avatar
Daniel Campora committed
138
139
```

140
In the `saxpy` namespace the parameters and properties are specified. Parameters _scope_ can either be the host or the device, and they can either be inputs or outputs. Parameters should be defined with the following convention:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
141

142
    (<scope>_<io>(<name>, <type>), <identifier>)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
143

144
Some parameter examples:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
145

Daniel Campora's avatar
Daniel Campora committed
146
* `(DEVICE_INPUT(dev_offsets_all_velo_tracks_t, unsigned), dev_atomics_velo)`: Defines an input on the _device memory_. It has a name `dev_offsets_all_velo_tracks_t`, which can be later used to identify this argument. It is of type _unsigned_, which means the memory location named `dev_offsets_all_velo_tracks_t` holds `unsigned`s. The _io_ and the _type_ define the underlying type of the instance to be `<io> <type> *` -- in this case, since it is an input type, `const unsigned*`. Its identifier is `dev_atomics_velo`.
147
* `(DEVICE_OUTPUT(dev_saxpy_output_t, float), dev_saxpy_output)`: Defines an output parameter on _device memory_, with name `dev_saxpy_output_t` and identifier `dev_saxpy_output`. Its underlying type is `float*`.
148
* `(HOST_INPUT(host_number_of_events_t, unsigned), host_number_of_events)`: Defines an input parameter on _host memory_, with name `host_number_of_events_t` and identifier `host_number_of_events`. Its underlying type is `const unsigned*`.
149
* `(DEVICE_INPUT(dev_number_of_events_t, unsigned), dev_number_of_events)`: Defines an input parameter on _device memory_, with name `dev_number_of_events_t` and identifier `dev_number_of_events`. Its underlying type is `const unsigned*`.
150
151
152
153
154
155

Properties of algorithms define constants can be configured prior to running the application. They are defined in two parts. First, they should be defined in the `DEFINE_PARAMETERS` macro following the convention:

    (PROPERTY(<name>, <key>, <description>, <type>), <identifier>)

* `(PROPERTY(saxpy_scale_factor_t, "saxpy_scale_factor", "scale factor a used in a*x + y", float), saxpy_scale_factor)`: Property with name `saxpy_scale_factor_t` is of type `float`. It will be accessible through key `"saxpy_scale_factor"` in a python configuration file, and it has description `"scale factor a used in a*x + y"`. Its identifier is `saxpy_scale_factor`. Properties _underlying type_ is always the same as their type, so in this case `float`.
Daniel Campora's avatar
Daniel Campora committed
156

157
158
159
160
161
162
163
164
165
166
And second, properties should be defined inside the algorithm struct as follows:

    Property<_name_> _internal_name_ {this, _default_value_}

In the case of saxpy:

```c++
  private:
    Property<saxpy_scale_factor_t> m_saxpy_factor {this, 2.f};
    Property<block_dim_t> m_block_dim {this, {{32, 1, 1}}};
Daniel Campora's avatar
Daniel Campora committed
167
168
```

169
170
171
172
173
#### Defining an algorithm

##### SAXPY_example.cuh

An algorithm is defined by a `struct` (or `class`) that inherits from either `HostAlgorithm` or `DeviceAlgorithm`. In addition, it is convenient to also inherit from `Parameters`, to be able to easily access _identifiers_ of parameters and properties. The struct identifier is the name of the algorithm.
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
174

175
176
177
178
An algorithm must define **two methods**: `set_arguments_size` and `operator()`. Their signatures are as follows:

```c++
  struct saxpy_t : public DeviceAlgorithm, Parameters {
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
179
    void set_arguments_size(
180
      ArgumentReferences<Parameters>,
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
181
182
      const RuntimeOptions&,
      const Constants&,
183
      const HostBuffers&) const;
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
184
185

    void operator()(
186
      const ArgumentReferences<Parameters>&,
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
187
      const RuntimeOptions&,
188
      const Constants&,
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
189
      HostBuffers&,
190
191
192
193
194
195
      cudaStream_t&,
      cudaEvent_t&) const;

  private:
    Property<saxpy_scale_factor_t> m_saxpy_factor {this, 2.f};
    Property<block_dim_t> m_block_dim {this, {{32, 1, 1}}};
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
196
197
  };
```
Daniel Campora's avatar
Daniel Campora committed
198

199
200
201
An algorithm `saxpy_t` has been declared. It is a `DeviceAlgorithm`, and for convenience it inherits from the previously defined `Parameters`. It defines two methods, `set_arguments_size` and `operator()` with the above predefined signatures. The algorithm declaration ends with the `private:` block for the properties mentioned before.

Since this is a DeviceAlgorithm, one would like the work to actually be done on the device. In order to run code on the device, a _global kernel_ has to be defined. The syntax used is standard CUDA:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
202

203
```
204
__global__ void saxpy(Parameters);
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
205
206
```

207
##### SAXPY_example.cu
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
208

209
The source file of SAXPY should define `set_arguments_size`, `operator()` and the previously mentioned _global kernel_ `saxpy`:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
210

211
212
* `set_arguments_size`: Sets the `size` of output parameters.
* `operator()`: The actual algorithm runs (similar to Gaudi).
Daniel Campora's avatar
Daniel Campora committed
213

214
In Allen, it is not recommended to use _dynamic memory allocations_. Therefore, types such as `std::vector` are "forbidden", and instead sizes of output arguments must be set in the `set_arguments_size` method of algorithms.
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
215

216
```c++
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
217
218
#include "SAXPY_example.cuh"

219
220
221
222
223
224
void saxpy::saxpy_t::set_arguments_size(
  ArgumentReferences<Parameters> arguments,
  const RuntimeOptions&,
  const Constants&,
  const HostBuffers&) const
{
225
  set_size<dev_saxpy_output_t>(arguments, first<host_number_of_events_t>(arguments));
226
227
228
}
```

Daniel Campora's avatar
Daniel Campora committed
229
To do that, one may use the following functions:
230

Daniel Campora's avatar
Daniel Campora committed
231
232
233
234
* `void set_size<T>(arguments, const size_t)`: Sets the size of _name_ `T`. The `sizeof(T)` is implicit, so eg. `set_size<T>(10)` will actually allocate space for `10 * sizeof(T)`.
* `size_t size<T>(arguments)`: Gets the size of _name_ `T`.
* `T* data<T>(arguments)`: Gets the pointer to the beginning of `T`.
* `T first<T>(arguments)`: Gets the first element of `T`.
235
236
237
238
239
240
241
242
243

Next, `operator()` should be defined:

```c++
void saxpy::saxpy_t::operator()(
  const ArgumentReferences<Parameters>& arguments,
  const RuntimeOptions&,
  const Constants&,
  HostBuffers&,
244
  cudaStream_t& stream,
245
246
247
  cudaEvent_t&) const
{
  global_function(saxpy)(
Daniel Campora's avatar
Daniel Campora committed
248
    dim3(1),
249
    property<block_dim_t>(),
250
    stream)(arguments);
251
252
253
254
255
}
```

In order to invoke host and global functions, wrapper methods `host_function` and `global_function` should be used. The syntax is as follows:

Daniel Campora's avatar
Daniel Campora committed
256
    host_function(<host_function_identifier>)(<parameters of function>)
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
    global_function(<global_function_identifier>)(<grid_size>, <block_size>, <stream>)(<parameters of function>)

`global_function` wraps a function identifier, such as `saxpy`. The object it returns can be used to invoke a _global kernel_ following a syntax that is similar to [CUDA's kernel invocation syntax](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels). It expects:

* `grid_size`: Number of blocks of kernel invocation (passed as 3-dimensional object of type `dim3`).
* `block_size`: Number of threads in each block (passed as 3-dimensional object of type `dim3`).
* `stream`: Stream where to run.
* `parameters of function`: Parameters of the _global kernel_ being invoked.

In this case, the kernel `saxpy` accepts only one parameter of type `Parameters`. The global_function and host_function wrappers automatically detect and transform `const ArgumentReferences<Parameters>&` into `Parameters`. Therefore, we can safely pass `arguments` to our kernel invocation.

Finally, the kernel is defined:

```c++
/**
 * @brief SAXPY example algorithm
 * @detail Calculates for every event y = a*x + x, where x is the number of velo tracks in one event
 */
275
__global__ void saxpy::saxpy(saxpy::Parameters parameters)
276
{
277
  const auto number_of_events = parameters.dev_number_of_events[0];
Daniel Campora's avatar
Daniel Campora committed
278
  for (unsigned event_number = threadIdx.x; event_number < number_of_events; event_number += blockDim.x) {
Daniel Campora's avatar
Daniel Campora committed
279
280
    Velo::Consolidated::ConstTracks velo_tracks {
      parameters.dev_atomics_velo, parameters.dev_velo_track_hit_number, event_number, number_of_events};
Daniel Campora's avatar
Daniel Campora committed
281
    const unsigned number_of_tracks_event = velo_tracks.number_of_tracks(event_number);
Daniel Campora's avatar
Daniel Campora committed
282
    
283
    parameters.dev_saxpy_output[event_number] =
Daniel Campora's avatar
Daniel Campora committed
284
        parameters.saxpy_scale_factor * number_of_tracks_event + number_of_tracks_event;
Daniel Campora's avatar
Daniel Campora committed
285
  }
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
286
287
}
```
Daniel Campora's avatar
Daniel Campora committed
288

289
290
291
292
293
294
295
296
297
298
The kernel accepts a single parameter of type `saxpy::Parameters`. It is now possible to access all previously defined parameters by their _identifier_. Things to remember here:

* A parameter or property is accessed with its _identifier_.
* Parameters decays to _underlying type_ (eg. formed from its _scope_ and its _type_).
* Properties decay to _type_.
* If explicit access to the _underlying type_ of parameters is required, `get()` can be used.
* One should not access `host` parameters inside a function to be executed on the `device`, and viceversa.

In other words, in the code above:

299
* `parameters.dev_number_of_events` decays to `const unsigned*`.
300
301
* `parameters.dev_atomics_velo` decays to `const unsigned*`.
* `parameters.dev_velo_track_hit_number` decays to `const unsigned*`.
302
303
304
* `parameters.dev_saxpy_output` decays to `float*`.
* `parameters.saxpy_scale_factor` decays to `float`, and has default value `2.f`.

305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
How to access processing event in other algorithms
--------------------------------------------------

Typically, events are processed by independent blocks of execution. When that's the case, the invocation of the global function happens with as many blocks as events in the event list. Eg.

```c++
  global_function(kernel)(
    size<dev_event_list_t>(),
    property<block_dim_t>(),
    stream)(arguments);
```

Then, in the kernel itself, in order to access the event under execution, the following idiom is used:

```c++
__global__ void kernel(namespace::Parameters parameters) {
  const unsigned event_number = parameters.dev_event_list[blockIdx.x];
```

Configuring the algorithm in a sequence
---------------------------------------

327
328
The last thing remaining is to add the algorithm to a sequence, and run it.

Daniel Campora's avatar
Daniel Campora committed
329
* [This readme](configuration/readme.md) explains how to configure the algorithms in an HLT1 sequence.