contributing.md 12.5 KB
Newer Older
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
1
Allen: Adding a new CUDA algorithm
Daniel Campora's avatar
Daniel Campora committed
2
3
=====================================

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
4
This tutorial will guide you through adding a new CUDA algorithm to the `Allen` project.
Daniel Campora's avatar
Daniel Campora committed
5
6
7
8

SAXPY
-----

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
9
Writing an algorithm in CUDA in the `Allen` project is no different than writing it on any other GPU project. The differences are in how to invoke that program, and how to setup the options, arguments, and so on.
Daniel Campora's avatar
Daniel Campora committed
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

So let's assume that we have the following simple `SAXPY` algorithm, taken out from this website https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/

```clike=
__global__ void saxpy(float *x, float *y, int n, float a) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}
```

### Adding the CUDA algorithm

We want to add the algorithm to a specific folder inside the `cuda` folder:

```
├── cuda
│   ├── CMakeLists.txt
│   └── velo
│       ├── CMakeLists.txt
│       ├── calculate_phi_and_sort
│       │   ├── include
│       │   │   └── CalculatePhiAndSort.cuh
│       │   └── src
│       │       ├── CalculatePhiAndSort.cu
│       │       ├── CalculatePhi.cu
│       │       └── SortByPhi.cu
│       ├── common
│       │   ├── include
│       │   │   ├── ClusteringDefinitions.cuh
│       │   │   └── VeloDefinitions.cuh
│       │   └── src
│       │       ├── ClusteringDefinitions.cu
│       │       └── Definitions.cu
...
```

Let's create a new folder inside the `cuda` directory named `test`. We need to modify `cuda/CMakeLists.txt` to reflect this:

```cmake=
add_subdirectory(velo)
add_subdirectory(test)
```

Inside the `test` folder we will create the following structure:

```
├── test
│   ├── CMakeLists.txt
│   └── saxpy
│       ├── include
│       │   └── Saxpy.cuh
│       └── src
│           └── Saxpy.cu
```

The newly created `test/CMakeLists.txt` file should reflect the project we are creating. We can do that by populating it like so:

```cmake=
file(GLOB test_saxpy "saxpy/src/*cu")
include_directories(saxpy/include)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
70
include_directories(${CMAKE_SOURCE_DIR}/stream/handlers/include)
Daniel Campora's avatar
Daniel Campora committed
71

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
72
add_library(Test STATIC
Daniel Campora's avatar
Daniel Campora committed
73
74
75
76
  ${test_saxpy}
)
```

Daniel Campora's avatar
Daniel Campora committed
77
Our CUDA algorithm `Saxpy.cuh` and `Saxpy.cu` will be as follows. Note we need to specify the required arguments in the `ALGORITHM`, let's give the arguments names that won't collide, like `dev_x` and `dev_y`:
Daniel Campora's avatar
Daniel Campora committed
78
79

```clike=
Daniel Campora's avatar
Daniel Campora committed
80
#include "Handler.cuh"
Daniel Campora's avatar
Daniel Campora committed
81
#include "ArgumentsCommon.cuh"
Daniel Campora's avatar
Daniel Campora committed
82

Daniel Campora's avatar
Daniel Campora committed
83
__global__ void saxpy(float *x, float *y, int n, float a);
Daniel Campora's avatar
Daniel Campora committed
84

Daniel Campora's avatar
Daniel Campora committed
85
86
87
88
89
ALGORITHM(saxpy, saxpy_t,
  ARGUMENTS(
    dev_x,
    dev_y
))
Daniel Campora's avatar
Daniel Campora committed
90
91
92
93
94
95
96
97
98
99
100
```

```clike=
#include "Saxpy.cuh"

__global__ void saxpy(float *x, float *y, int n, float a) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}
```

Daniel Campora's avatar
Daniel Campora committed
101
The line with `ALGORITHM` encapsulates our algorithm `saxpy` into a class with name `saxpy_t`. We will use this class from now on to be able to refer to our algorithm.
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
102
Therefore, when developing algorithms for the HLT1 chain, please add the sub-detector that your algorithm belongs to in the name so that it can be easily identified within a sequence. For example: `velo_masked_clustering_t` or `ut_pre_decode_t`.
Daniel Campora's avatar
Daniel Campora committed
103

Daniel Campora's avatar
Daniel Campora committed
104
105
106
107
108
109
110
111
112
113
Lastly, edit `stream/CMakeLists.txt` and modify `target_link_libraries`:

```cmake
target_link_libraries(Stream Velo Test)
```

Ready to move on.

### Integrating the algorithm in the sequence

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
114
`Allen` centers around the idea of running a __sequence of algorithms__ on input events. This sequence is predefined and will always be executed in the same order.
Daniel Campora's avatar
Daniel Campora committed
115
116
117

Some events from the input will be discarded throughout the execution, and only a fraction of them will be kept for further processing. That is conceptually the idea behind the _High Level Trigger 1_ stage of LHCb, and is what is intended to achieve with this project.

118
Therefore, we need to add our algorithm to the sequence of algorithms. First, make the folder visible to CMake by editing the file `stream/CMakeLists.txt` and adding:
Daniel Campora's avatar
Daniel Campora committed
119

120
```clike
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
121
include_directories(${CMAKE_SOURCE_DIR}/cuda/test/saxpy/include)
122
123
```

Daniel Campora's avatar
Daniel Campora committed
124
Then, add the following include to `stream/setup/include/ConfiguredSequence.cuh`:
Daniel Campora's avatar
Daniel Campora committed
125
126

```clike
127
128
#include "Saxpy.cuh"
```
Daniel Campora's avatar
Daniel Campora committed
129

Daniel Campora's avatar
Daniel Campora committed
130
Now, we are ready to add our algorithm to a sequence. All available sequences live in the folder `configuration/sequences/`. The sequence to execute can be chosen at compile time, by appending the name of the desired sequence to the cmake call: `cmake -DSEQUENCE=DefaultSequence ..`. For now, let's just edit the `DefaultSequence`. Add the algorithm to `configuration/sequences/DefaultSequence.h` as follows:
131
132

```clike
Daniel Campora's avatar
Daniel Campora committed
133
/**
134
 * Specify here the algorithms to be executed in the sequence,
Daniel Campora's avatar
Daniel Campora committed
135
 * in the expected order of execution.
Daniel Campora's avatar
Daniel Campora committed
136
 */
137
SEQUENCE_T(
Daniel Campora's avatar
Daniel Campora committed
138
139
140
141
142
143
144
145
  ...
  prefix_sum_reduce_velo_track_hit_number_t,
  prefix_sum_single_block_velo_track_hit_number_t,
  prefix_sum_scan_velo_track_hit_number_t,
  consolidate_tracks_t,
  saxpy_t,
  ...
)
Daniel Campora's avatar
Daniel Campora committed
146
147
```

Daniel Campora's avatar
Daniel Campora committed
148
149
Keep in mind the order matters, and will define when your algorithm is scheduled. In this case, we have chosen to add it after the algorithm identified by `consolidate_tracks_t`.

Daniel Campora's avatar
Daniel Campora committed
150
151
152
153
Next, we need to define the arguments to be passed to our function. We need to define them in order for the dynamic scheduling machinery to properly work - that is, allocate what is needed only when it's needed, and manage the memory for us.

We will distinguish arguments just passed by value from pointers to device memory. We don't need to schedule those simply passed by value like `n` and `a`. We care however about `x` and `y`, since they require some reserving and freeing in memory.

Daniel Campora's avatar
Daniel Campora committed
154
In the algorithm definition we used the arguments `dev_x` and `dev_y`. We need to define the arguments, to make them available to our algorithm. Let's add these types to the common arguments, in `stream/setup/include/ArgumentsCommon.cuh`:
Daniel Campora's avatar
Daniel Campora committed
155
156

```clike
Daniel Campora's avatar
Daniel Campora committed
157
...
158
159
ARGUMENT(dev_x, float)
ARGUMENT(dev_y, float)
Daniel Campora's avatar
Daniel Campora committed
160
161
```

162
Optionally, some types are required to live throughout the whole sequence since its creation. An argument can be specified to be persistent in memory by adding it to the `output_arguments_t` tuple, in `AlgorithmDependencies.cuh`:
Daniel Campora's avatar
Daniel Campora committed
163
164

```clike
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
/**
 * @brief Output arguments, ie. that cannot be freed.
 * @details The arguments specified in this type will
 *          be kept allocated since their first appearance
 *          until the end of the sequence.
 */
typedef std::tuple<
  dev_atomics_storage,
  dev_velo_track_hit_number,
  dev_velo_track_hits,
  dev_atomics_veloUT,
  dev_veloUT_tracks,
  dev_scifi_tracks,
  dev_n_scifi_tracks
> output_arguments_t;
Daniel Campora's avatar
Daniel Campora committed
180
181
182
183
184
185
```

### Preparing and invoking the algorithms in the sequence

Now all the pieces are in place, we are ready to prepare the algorithm and do the actual invocation.

Daniel Campora's avatar
Daniel Campora committed
186
First go to `stream/sequence/include/HostBuffers.cuh` and add the saxpy host memory pointer:
Daniel Campora's avatar
Daniel Campora committed
187
188
189
190
191

```clike
  ...
    
  // Pinned host datatypes
Daniel Campora's avatar
Daniel Campora committed
192
  uint* host_velo_tracks_atomics;
Daniel Campora's avatar
Daniel Campora committed
193
  uint* host_velo_track_hit_number;
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
194
  uint* host_velo_track_hits;
Daniel Campora's avatar
Daniel Campora committed
195
196
197
  uint* host_total_number_of_velo_clusters;
  uint* host_number_of_reconstructed_velo_tracks;
  uint* host_accumulated_number_of_hits_in_velo_tracks;
Daniel Campora's avatar
Daniel Campora committed
198
  uint* host_accumulated_number_of_ut_hits;
Daniel Campora's avatar
Daniel Campora committed
199
200
201
202
203
204
205
206

  // Saxpy
  int saxpy_N = 1<<20;
  float *host_x, *host_y;

  ...
```

Daniel Campora's avatar
Daniel Campora committed
207
Reserve that host memory in `stream/sequence/src/HostBuffers.cu`:
Daniel Campora's avatar
Daniel Campora committed
208
209
210
211

```clike
  ...
    
Daniel Campora's avatar
Daniel Campora committed
212
  cudaCheck(cudaMallocHost((void**)&host_velo_tracks_atomics, (2 * max_number_of_events + 1) * sizeof(int)));
Daniel Campora's avatar
Daniel Campora committed
213
  cudaCheck(cudaMallocHost((void**)&host_velo_track_hit_number, max_number_of_events * VeloTracking::max_tracks * sizeof(uint)));
Daniel Campora's avatar
Daniel Campora committed
214
  cudaCheck(cudaMallocHost((void**)&host_velo_track_hits, max_number_of_events * VeloTracking::max_tracks * VeloTracking::max_track_size * sizeof(Velo::Hit)));
Daniel Campora's avatar
Daniel Campora committed
215
216
217
  cudaCheck(cudaMallocHost((void**)&host_total_number_of_velo_clusters, sizeof(uint)));
  cudaCheck(cudaMallocHost((void**)&host_number_of_reconstructed_velo_tracks, sizeof(uint)));
  cudaCheck(cudaMallocHost((void**)&host_accumulated_number_of_hits_in_velo_tracks, sizeof(uint)));
Daniel Campora's avatar
Daniel Campora committed
218
219
220
221
  cudaCheck(cudaMallocHost((void**)&host_veloUT_tracks, max_number_of_events * VeloUTTracking::max_num_tracks * sizeof(VeloUTTracking::TrackUT)));
  cudaCheck(cudaMallocHost((void**)&host_atomics_veloUT, VeloUTTracking::num_atomics * max_number_of_events * sizeof(int)));
  cudaCheck(cudaMallocHost((void**)&host_accumulated_number_of_ut_hits, sizeof(uint)));
  cudaCheck(cudaMallocHost((void**)&host_accumulated_number_of_scifi_hits, sizeof(uint)));
Daniel Campora's avatar
Daniel Campora committed
222
223
224
225
226
227
228
229
  
  // Saxpy memory allocations
  cudaCheck(cudaMallocHost((void**)&host_x, saxpy_N * sizeof(float)));
  cudaCheck(cudaMallocHost((void**)&host_y, saxpy_N * sizeof(float)));

  ...
```

Daniel Campora's avatar
Daniel Campora committed
230
Finally, create a visitor for your newly created algorithm. Create a containing folder structure for it in `stream/visitors/test/src/`, and a new file inside named `SaxpyVisitor.cu`. Insert the following code inside:
Daniel Campora's avatar
Daniel Campora committed
231
232

```clike
Daniel Campora's avatar
Daniel Campora committed
233
234
#include "StreamVisitor.cuh"
#include "Saxpy.cuh"
Daniel Campora's avatar
Daniel Campora committed
235

Daniel Campora's avatar
Daniel Campora committed
236
template<>
237
238
239
240
241
242
243
244
245
246
247
248
249
250
void SequenceVisitor::set_arguments_size<saxpy_t>(
  const RuntimeOptions& runtime_options,
  const Constants& constants,
  const HostBuffers& host_buffers,
  argument_manager_t& arguments)
{
  // Set arguments size
  int saxpy_N = 1<<20;
  arguments.set_size<dev_x>(saxpy_N);
  arguments.set_size<dev_y>(saxpy_N);
}

template<>
void SequenceVisitor::visit<saxpy_t>(
Daniel Campora's avatar
Daniel Campora committed
251
252
253
  saxpy_t& state,
  const RuntimeOptions& runtime_options,
  const Constants& constants,
254
  argument_manager_t& arguments,
Daniel Campora's avatar
Daniel Campora committed
255
256
257
258
  HostBuffers& host_buffers,
  cudaStream_t& cuda_stream,
  cudaEvent_t& cuda_generic_event)
{
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
  // Saxpy test
  int saxpy_N = 1<<20;
  for (int i = 0; i < saxpy_N; i++) {
    host_buffers.host_x[i] = 1.0f;
    host_buffers.host_y[i] = 2.0f;
  }

  // Copy memory from host to device
  cudaCheck(cudaMemcpyAsync(
    arguments.offset<dev_x>(),
    host_buffers.host_x,
    saxpy_N * sizeof(float),
    cudaMemcpyHostToDevice,
    cuda_stream
  ));

  cudaCheck(cudaMemcpyAsync(
    arguments.offset<dev_y>(),
    host_buffers.host_y,
    saxpy_N * sizeof(float),
    cudaMemcpyHostToDevice,
    cuda_stream
  ));

  // Setup opts for kernel call
  state.set_opts(dim3((saxpy_N+255)/256), dim3(256), cuda_stream);
  
  // Setup arguments for kernel call
  state.set_arguments(
    arguments.offset<dev_x>(),
    arguments.offset<dev_y>(),
    saxpy_N,
    2.0f
  );

  // Kernel call
  state.invoke();

  // Retrieve result
  cudaCheck(cudaMemcpyAsync(
    host_buffers.host_y,
    arguments.offset<dev_y>(),
    arguments.size<dev_y>(),
    cudaMemcpyDeviceToHost,
    cuda_stream
  ));

  // Wait to receive the result
  cudaEventRecord(cuda_generic_event, cuda_stream);
  cudaEventSynchronize(cuda_generic_event);

  // Check the output
  float maxError = 0.0f;
  for (int i=0; i<saxpy_N; i++) {
    maxError = std::max(maxError, abs(host_buffers.host_y[i]-4.0f));
  }
  info_cout << "Saxpy max error: " << maxError << std::endl << std::endl;
}
Daniel Campora's avatar
Daniel Campora committed
317
318
```

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
319
320
321
322
323
324
As a last step, add the visitor to `stream/CMakeLists.txt`:

```clike
...
file(GLOB stream_visitors_test "visitors/test/src/*cu")
...
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
325
add_library(Stream STATIC
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
326
327
328
329
330
${stream_visitors_test}
...
```

We can compile the code and run the program `./Allen`. If everything went well, the following text should appear:
Daniel Campora's avatar
Daniel Campora committed
331
332
333
334
335

```
Saxpy max error: 0.00
```

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
336
The cool thing is your algorithm is now part of the sequence. You can see how memory is managed, taking into account your algorithm, and how it changes on every step by appending the `-p` option: `./Allen -p`
Daniel Campora's avatar
Daniel Campora committed
337
338

```
Daniel Campora's avatar
Daniel Campora committed
339
Sequence step 13 "saxpy_t" memory segments (MiB):
Daniel Campora's avatar
Daniel Campora committed
340
341
342
343
dev_velo_track_hit_number (0.01), unused (0.05), dev_atomics_storage (0.00), unused (1.30), dev_velo_track_hits (0.26), dev_x (4.00), dev_y (4.00), unused (1014.39), 
Max memory required: 9.61 MiB
```

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
344
345
346
347
348
349
350
351
352
353

Before placing a merge request
==============================
Before starting to edit files, please ensure that your editor produces spaces, not tabs!

Before placing a merge request, please go through the following list and check that BOTH compilation and running work after your changes:
   * Release and debug mode `cmake -DCMAKE_BUILD_TYPE=release ..` and `cmake -DCMAKE_BUILD_TYPE=debug ..`
   * Different sequences:
      * Default sequence: `cmake -DSEQUENCE=DefaultSequence ..`
      * CPU SciFi tracking sequence: `cmake -DSEQUENCE=CPUSciFi ..`
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
354
      * CPU PV finding sequence: `cmake -DSEQUENCE=CPUPVSequence ..`
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
355
356
357
  * Compilation with ROOT (if you have a ROOT installation available): `cmake -DUSE_ROOT=TRUE ..` If you don't have ROOT available, please mention this in the merge request, then we will test it.
  

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
358
Check that you can run `./Allen` after every compilation. 
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
359
360
361
  

Now you are ready to take over!
Daniel Campora's avatar
Daniel Campora committed
362
363

Good luck!