contributing.md 13 KB
Newer Older
1
2
Allen: Adding a new device algorithm
====================================
Daniel Campora's avatar
Daniel Campora committed
3

4
This tutorial will guide you through adding a new device algorithm to the `Allen` framework.
Daniel Campora's avatar
Daniel Campora committed
5
6
7
8

SAXPY
-----

9
Writing functions to be executed in the device in `Allen` is literally the same as writing a CUDA kernel. Therefore, you may use any existing tutorial or documentation on how to write good CUDA code.
Daniel Campora's avatar
Daniel Campora committed
10

11
Writing a device algorithm in the `Allen` framework has been made to resemble the Gaudi syntax, where possible.
Daniel Campora's avatar
Daniel Campora committed
12

13
14
15
Let's assume that we want to run the following classic `SAXPY` CUDA kernel, taken out from this website https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/ :

```c++
Daniel Campora's avatar
Daniel Campora committed
16
17
18
19
20
21
__global__ void saxpy(float *x, float *y, int n, float a) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}
```

22
### Adding the device algorithm
Daniel Campora's avatar
Daniel Campora committed
23

24
We want to add the algorithm to a specific folder inside the `device` folder:
Daniel Campora's avatar
Daniel Campora committed
25

Daniel Campora's avatar
Daniel Campora committed
26
```console
27
28
device
├── associate
Daniel Campora's avatar
Daniel Campora committed
29
│   ├── CMakeLists.txt
30
31
32
33
34
35
36
37
38
39
40
41
42
43
│   ├── include
│   │   ├── AssociateConstants.cuh
│   │   └── VeloPVIP.cuh
│   └── src
│       └── VeloPVIP.cu
├── CMakeLists.txt
├── event_model
│   ├── associate
│   │   └── include
│   │       └── AssociateConsolidated.cuh
│   ├── common
│   │   └── include
│   │       ├── ConsolidatedTypes.cuh
│   │       └── States.cuh
Daniel Campora's avatar
Daniel Campora committed
44
45
46
...
```

47
Let's create a new folder inside the `device` directory named `example`. We need to modify `device/CMakeLists.txt` to reflect this:
Daniel Campora's avatar
Daniel Campora committed
48

Daniel Campora's avatar
Daniel Campora committed
49
```cmake
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
50
51
add_subdirectory(raw_banks)
add_subdirectory(example)
Daniel Campora's avatar
Daniel Campora committed
52
53
54
55
```

Inside the `test` folder we will create the following structure:

Daniel Campora's avatar
Daniel Campora committed
56
```console
Daniel Campora's avatar
Daniel Campora committed
57
58
├── test
│   ├── CMakeLists.txt
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
59
│   └── example
Daniel Campora's avatar
Daniel Campora committed
60
│       ├── include
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
61
│       │   └── Saxpy_example.cuh
Daniel Campora's avatar
Daniel Campora committed
62
│       └── src
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
63
│           └── Saxpy_example.cu
Daniel Campora's avatar
Daniel Campora committed
64
65
66
67
```

The newly created `test/CMakeLists.txt` file should reflect the project we are creating. We can do that by populating it like so:

Daniel Campora's avatar
Daniel Campora committed
68
```cmake
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
69
70
71
file(GLOB saxpy_sources "src/*cu")

include_directories(include)
72
73
74
include_directories(${CMAKE_SOURCE_DIR}/device/velo/common/include)
include_directories(${CMAKE_SOURCE_DIR}/device/event_model/common/include)
include_directories(${CMAKE_SOURCE_DIR}/device/event_model/velo/include)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
75
include_directories(${CMAKE_SOURCE_DIR}/main/include)
76
include_directories(${CMAKE_SOURCE_DIR}/stream/gear/include)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
77
include_directories(${CMAKE_SOURCE_DIR}/stream/sequence/include)
Daniel Campora's avatar
Daniel Campora committed
78

79
allen_add_device_library(Examples STATIC
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
80
  ${saxpy_sources}
Daniel Campora's avatar
Daniel Campora committed
81
82
)
```
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
83
84
85
86
The includes of Velo and event model files are only necessary because we will use the number of Velo tracks per event as an input to the saxpy algorithm.
If your new algorithm does not use any Velo related objects, this is not necessary.

The includes of main, gear and sequence are required for any new algorithm in Allen.
Daniel Campora's avatar
Daniel Campora committed
87

88
Link the new library "Examples" to the stream library in `stream/CMakeLists.txt`:
Daniel Campora's avatar
Daniel Campora committed
89
```cmake
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
90
91
92
93
94
95
96
97
98
99
100
101
target_link_libraries(Stream PRIVATE
  CudaCommon
  Associate
  Velo
  AllenPatPV
  PV_beamline
  HostClustering
  HostPrefixSum
  UT
  Kalman
  VertexFitter
  RawBanks
102
  Selections
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
103
104
105
106
  SciFi
  HostGEC
  Muon
  Utils
107
108
109
  Examples
  HostDataProvider
  HostInitEventList)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
110
111
```

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
112
Next, we create the header file for our algorithm `SAXPY_example.cuh`, which is similar to an algorithm definition in Gaudi: inputs, outputs and properties are defined, as well as the algorithm function itself and an operator calling the function.
113

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
114
There are slight differences to Gaudi, since we want to be able to run the algorithm on a GPU.
115
The full file is under `device/example/include/SAXPY_example.cuh`. Let's take a look at the components:
Daniel Campora's avatar
Daniel Campora committed
116

117
```c++
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
118
#pragma once
Daniel Campora's avatar
Daniel Campora committed
119

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
120
121
122
#include "VeloConsolidated.cuh"
#include "DeviceAlgorithm.cuh"
```
Daniel Campora's avatar
Daniel Campora committed
123

124
The Velo include is only required if Velo objects are used in the algorithm. `DeviceAlgorithm.cuh` defines class `DeviceAlgorithm` and some other handy resources.
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
125

126
#### Parameters and properties
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
127

128
129
130
131
132
133
134
135
136
137
```c++
namespace saxpy {
  DEFINE_PARAMETERS(
    Parameters,
    (HOST_INPUT(host_number_of_selected_events_t, uint), host_number_of_selected_events),
    (DEVICE_INPUT(dev_offsets_all_velo_tracks_t, uint), dev_atomics_velo),
    (DEVICE_INPUT(dev_offsets_velo_track_hit_number_t, uint), dev_velo_track_hit_number),
    (DEVICE_OUTPUT(dev_saxpy_output_t, float), dev_saxpy_output),
    (PROPERTY(saxpy_scale_factor_t, "saxpy_scale_factor", "scale factor a used in a*x + y", float), saxpy_scale_factor),
    (PROPERTY(block_dim_t, "block_dim", "block dimensions", DeviceDimensions), block_dim))
Daniel Campora's avatar
Daniel Campora committed
138
139
```

140
In the `saxpy` namespace the parameters and properties are specified. Parameters _scope_ can either be the host or the device, and they can either be inputs or outputs. Parameters should be defined with the following convention:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
141

142
    (<scope>_<io>(<name>, <type>), <identifier>)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
143

144
Some parameter examples:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
145

146
147
148
149
150
151
152
153
154
* `(DEVICE_INPUT(dev_offsets_all_velo_tracks_t, uint), dev_atomics_velo)`: Defines an input on the _device memory_. It has a name `dev_offsets_all_velo_tracks_t`, which can be later used to identify this argument. It is of type _uint_, which means the memory location named `dev_offsets_all_velo_tracks_t` holds `uint`s. The _io_ and the _type_ define the underlying type of the instance to be `<io> <type> *` -- in this case, since it is an input type, `const uint*`. Its identifier is `dev_atomics_velo`.
* `(DEVICE_OUTPUT(dev_saxpy_output_t, float), dev_saxpy_output)`: Defines an output parameter on _device memory_, with name `dev_saxpy_output_t` and identifier `dev_saxpy_output`. Its underlying type is `float*`.
* `(HOST_INPUT(host_number_of_selected_events_t, uint), host_number_of_selected_events)`: Defines an input parameter on _host memory_, with name `host_number_of_selected_events_t` and identifier `host_number_of_selected_events`. Its underlying type is `const uint*`.

Properties of algorithms define constants can be configured prior to running the application. They are defined in two parts. First, they should be defined in the `DEFINE_PARAMETERS` macro following the convention:

    (PROPERTY(<name>, <key>, <description>, <type>), <identifier>)

* `(PROPERTY(saxpy_scale_factor_t, "saxpy_scale_factor", "scale factor a used in a*x + y", float), saxpy_scale_factor)`: Property with name `saxpy_scale_factor_t` is of type `float`. It will be accessible through key `"saxpy_scale_factor"` in a python configuration file, and it has description `"scale factor a used in a*x + y"`. Its identifier is `saxpy_scale_factor`. Properties _underlying type_ is always the same as their type, so in this case `float`.
Daniel Campora's avatar
Daniel Campora committed
155

156
157
158
159
160
161
162
163
164
165
And second, properties should be defined inside the algorithm struct as follows:

    Property<_name_> _internal_name_ {this, _default_value_}

In the case of saxpy:

```c++
  private:
    Property<saxpy_scale_factor_t> m_saxpy_factor {this, 2.f};
    Property<block_dim_t> m_block_dim {this, {{32, 1, 1}}};
Daniel Campora's avatar
Daniel Campora committed
166
167
```

168
169
170
171
172
#### Defining an algorithm

##### SAXPY_example.cuh

An algorithm is defined by a `struct` (or `class`) that inherits from either `HostAlgorithm` or `DeviceAlgorithm`. In addition, it is convenient to also inherit from `Parameters`, to be able to easily access _identifiers_ of parameters and properties. The struct identifier is the name of the algorithm.
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
173

174
175
176
177
An algorithm must define **two methods**: `set_arguments_size` and `operator()`. Their signatures are as follows:

```c++
  struct saxpy_t : public DeviceAlgorithm, Parameters {
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
178
    void set_arguments_size(
179
      ArgumentReferences<Parameters>,
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
180
181
      const RuntimeOptions&,
      const Constants&,
182
      const HostBuffers&) const;
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
183
184

    void operator()(
185
      const ArgumentReferences<Parameters>&,
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
186
      const RuntimeOptions&,
187
      const Constants&,
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
188
      HostBuffers&,
189
190
191
192
193
194
      cudaStream_t&,
      cudaEvent_t&) const;

  private:
    Property<saxpy_scale_factor_t> m_saxpy_factor {this, 2.f};
    Property<block_dim_t> m_block_dim {this, {{32, 1, 1}}};
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
195
196
  };
```
Daniel Campora's avatar
Daniel Campora committed
197

198
199
200
An algorithm `saxpy_t` has been declared. It is a `DeviceAlgorithm`, and for convenience it inherits from the previously defined `Parameters`. It defines two methods, `set_arguments_size` and `operator()` with the above predefined signatures. The algorithm declaration ends with the `private:` block for the properties mentioned before.

Since this is a DeviceAlgorithm, one would like the work to actually be done on the device. In order to run code on the device, a _global kernel_ has to be defined. The syntax used is standard CUDA:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
201

202
```
Daniel Campora's avatar
Daniel Campora committed
203
__global__ void saxpy(Parameters, const uint number_of_events);
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
204
205
```

206
##### SAXPY_example.cu
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
207

208
The source file of SAXPY should define `set_arguments_size`, `operator()` and the previously mentioned _global kernel_ `saxpy`:
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
209

210
211
* `set_arguments_size`: Sets the `size` of output parameters.
* `operator()`: The actual algorithm runs (similar to Gaudi).
Daniel Campora's avatar
Daniel Campora committed
212

213
In Allen, it is not recommended to use _dynamic memory allocations_. Therefore, types such as `std::vector` are "forbidden", and instead sizes of output arguments must be set in the `set_arguments_size` method of algorithms.
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
214

215
```c++
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
216
217
#include "SAXPY_example.cuh"

218
219
220
221
222
223
void saxpy::saxpy_t::set_arguments_size(
  ArgumentReferences<Parameters> arguments,
  const RuntimeOptions&,
  const Constants&,
  const HostBuffers&) const
{
Daniel Campora's avatar
Daniel Campora committed
224
  set_size<dev_saxpy_output_t>(arguments, first<host_number_of_selected_events_t>(arguments));
225
226
227
}
```

Daniel Campora's avatar
Daniel Campora committed
228
To do that, one may use the following functions:
229

Daniel Campora's avatar
Daniel Campora committed
230
231
232
233
* `void set_size<T>(arguments, const size_t)`: Sets the size of _name_ `T`. The `sizeof(T)` is implicit, so eg. `set_size<T>(10)` will actually allocate space for `10 * sizeof(T)`.
* `size_t size<T>(arguments)`: Gets the size of _name_ `T`.
* `T* data<T>(arguments)`: Gets the pointer to the beginning of `T`.
* `T first<T>(arguments)`: Gets the first element of `T`.
234
235
236
237
238
239
240
241
242
243
244
245
246

Next, `operator()` should be defined:

```c++
void saxpy::saxpy_t::operator()(
  const ArgumentReferences<Parameters>& arguments,
  const RuntimeOptions&,
  const Constants&,
  HostBuffers&,
  cudaStream_t& cuda_stream,
  cudaEvent_t&) const
{
  global_function(saxpy)(
Daniel Campora's avatar
Daniel Campora committed
247
    dim3(1),
248
    property<block_dim_t>(),
Daniel Campora's avatar
Daniel Campora committed
249
    cuda_stream)(arguments, first<host_number_of_selected_events_t>(arguments));
250
251
252
253
254
}
```

In order to invoke host and global functions, wrapper methods `host_function` and `global_function` should be used. The syntax is as follows:

Daniel Campora's avatar
Daniel Campora committed
255
    host_function(<host_function_identifier>)(<parameters of function>)
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
    global_function(<global_function_identifier>)(<grid_size>, <block_size>, <stream>)(<parameters of function>)

`global_function` wraps a function identifier, such as `saxpy`. The object it returns can be used to invoke a _global kernel_ following a syntax that is similar to [CUDA's kernel invocation syntax](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels). It expects:

* `grid_size`: Number of blocks of kernel invocation (passed as 3-dimensional object of type `dim3`).
* `block_size`: Number of threads in each block (passed as 3-dimensional object of type `dim3`).
* `stream`: Stream where to run.
* `parameters of function`: Parameters of the _global kernel_ being invoked.

In this case, the kernel `saxpy` accepts only one parameter of type `Parameters`. The global_function and host_function wrappers automatically detect and transform `const ArgumentReferences<Parameters>&` into `Parameters`. Therefore, we can safely pass `arguments` to our kernel invocation.

Finally, the kernel is defined:

```c++
/**
 * @brief SAXPY example algorithm
 * @detail Calculates for every event y = a*x + x, where x is the number of velo tracks in one event
 */
Daniel Campora's avatar
Daniel Campora committed
274
__global__ void saxpy::saxpy(saxpy::Parameters parameters, const uint number_of_events)
275
276
277
278
279
{
  Velo::Consolidated::ConstTracks velo_tracks {
    parameters.dev_atomics_velo, parameters.dev_velo_track_hit_number, event_number, number_of_events};
  const uint number_of_tracks_event = velo_tracks.number_of_tracks(event_number);

Daniel Campora's avatar
Daniel Campora committed
280
  for (uint event_number = threadIdx.x; event_number < number_of_events; event_number += blockDim.x) {
281
282
    parameters.dev_saxpy_output[event_number] =
      parameters.saxpy_scale_factor * number_of_tracks_event + number_of_tracks_event;
Daniel Campora's avatar
Daniel Campora committed
283
  }
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
284
285
}
```
Daniel Campora's avatar
Daniel Campora committed
286

287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
The kernel accepts a single parameter of type `saxpy::Parameters`. It is now possible to access all previously defined parameters by their _identifier_. Things to remember here:

* A parameter or property is accessed with its _identifier_.
* Parameters decays to _underlying type_ (eg. formed from its _scope_ and its _type_).
* Properties decay to _type_.
* If explicit access to the _underlying type_ of parameters is required, `get()` can be used.
* One should not access `host` parameters inside a function to be executed on the `device`, and viceversa.

In other words, in the code above:

* `parameters.dev_atomics_velo` decays to `const uint*`.
* `parameters.dev_velo_track_hit_number` decays to `const uint*`.
* `parameters.dev_saxpy_output` decays to `float*`.
* `parameters.saxpy_scale_factor` decays to `float`, and has default value `2.f`.

The last thing remaining is to add the algorithm to a sequence, and run it.

Daniel Campora's avatar
Daniel Campora committed
304
* [This readme](configuration/readme.md) explains how to configure the algorithms in an HLT1 sequence.