contributing.md 9.37 KB
Newer Older
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
1
Allen: Adding a new CUDA algorithm
Daniel Campora's avatar
Daniel Campora committed
2
3
=====================================

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
4
This tutorial will guide you through adding a new CUDA algorithm to the `Allen` project.
Daniel Campora's avatar
Daniel Campora committed
5
6
7
8

SAXPY
-----

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
9
Writing an algorithm in CUDA in the `Allen` project is no different than writing it on any other GPU project. The differences are in how to invoke that program, and how to setup the options, arguments, and so on.
Daniel Campora's avatar
Daniel Campora committed
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

So let's assume that we have the following simple `SAXPY` algorithm, taken out from this website https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/

```clike=
__global__ void saxpy(float *x, float *y, int n, float a) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}
```

### Adding the CUDA algorithm

We want to add the algorithm to a specific folder inside the `cuda` folder:

```
├── cuda
│   ├── CMakeLists.txt
│   └── velo
│       ├── CMakeLists.txt
│       ├── calculate_phi_and_sort
│       │   ├── include
│       │   │   └── CalculatePhiAndSort.cuh
│       │   └── src
│       │       ├── CalculatePhiAndSort.cu
│       │       ├── CalculatePhi.cu
│       │       └── SortByPhi.cu
│       ├── common
│       │   ├── include
│       │   │   ├── ClusteringDefinitions.cuh
│       │   │   └── VeloDefinitions.cuh
│       │   └── src
│       │       ├── ClusteringDefinitions.cu
│       │       └── Definitions.cu
...
```

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
46
Let's create a new folder inside the `cuda` directory named `example`. We need to modify `cuda/CMakeLists.txt` to reflect this:
Daniel Campora's avatar
Daniel Campora committed
47
48

```cmake=
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
49
50
add_subdirectory(raw_banks)
add_subdirectory(example)
Daniel Campora's avatar
Daniel Campora committed
51
52
53
54
55
56
57
```

Inside the `test` folder we will create the following structure:

```
├── test
│   ├── CMakeLists.txt
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
58
│   └── example
Daniel Campora's avatar
Daniel Campora committed
59
│       ├── include
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
60
│       │   └── Saxpy_example.cuh
Daniel Campora's avatar
Daniel Campora committed
61
│       └── src
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
62
│           └── Saxpy_example.cu
Daniel Campora's avatar
Daniel Campora committed
63
64
65
66
67
```

The newly created `test/CMakeLists.txt` file should reflect the project we are creating. We can do that by populating it like so:

```cmake=
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
68
69
70
71
72
73
74
75
file(GLOB saxpy_sources "src/*cu")

include_directories(include)
include_directories(../velo/common/include)
include_directories(../event_model/common/include)
include_directories(../event_model/velo/include)

include_directories(${CMAKE_SOURCE_DIR}/main/include)
76
include_directories(${CMAKE_SOURCE_DIR}/stream/gear/include)
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
77
include_directories(${CMAKE_SOURCE_DIR}/stream/sequence/include)
Daniel Campora's avatar
Daniel Campora committed
78

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
79
80
81
82
83
84
85
86
if(TARGET_DEVICE STREQUAL "CPU" OR TARGET_DEVICE STREQUAL "CUDACLANG")
  foreach(source_file ${saxpy_sources})
    set_source_files_properties(${source_file} PROPERTIES LANGUAGE CXX)
  endforeach(source_file)
endif()

allen_add_device_library(Saxpy STATIC
  ${saxpy_sources}
Daniel Campora's avatar
Daniel Campora committed
87
88
)
```
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
89
90
91
92
The includes of Velo and event model files are only necessary because we will use the number of Velo tracks per event as an input to the saxpy algorithm.
If your new algorithm does not use any Velo related objects, this is not necessary.

The includes of main, gear and sequence are required for any new algorithm in Allen.
Daniel Campora's avatar
Daniel Campora committed
93

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
Link the new library "Saxpy" to the stream librariy in `stream/CMakeLists.txt`:
```cmake=
target_link_libraries(Stream PRIVATE
  HostStream
  CudaCommon
  Associate
  Velo
  AllenPatPV
  PV_beamline
  HostClustering
  HostPrefixSum
  UT
  Kalman
  VertexFitter
  RawBanks
  SciFi
  HostGEC
  Muon
  Utils
  Saxpy)
```

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
116
117
118
Next, we create the header file for our algorithm `SAXPY_example.cuh`, which is similar to an algorithm definition in Gaudi: inputs, outputs and properties are defined, as well as the algorithm function itself and an operator calling the function.
There are slight differences to Gaudi, since we want to be able to run the algorithm on a GPU.
The full file can be viewed [here](https://gitlab.cern.ch/lhcb/Allen/blob/dovombru_update_documentation/cuda/example/include/SAXPY_example.cuh). Let's take a look at the components:
Daniel Campora's avatar
Daniel Campora committed
119
120

```clike=
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
121
#pragma once
Daniel Campora's avatar
Daniel Campora committed
122

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
123
124
125
126
#include "VeloConsolidated.cuh"
#include "DeviceAlgorithm.cuh"
```
The Velo include is only required if Velo objects are used in the algorithm. `DeviceAlgorithm.cuh` needs to be included for every device algorithm.
Daniel Campora's avatar
Daniel Campora committed
127

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
128
129
130
131
132
133
134
135
136
137
```clike=
namespace saxpy {
  struct Parameters {
    HOST_INPUT(host_number_of_selected_events_t, uint);

    DEVICE_INPUT(dev_offsets_all_velo_tracks_t, uint) dev_atomics_velo; 
    DEVICE_INPUT(dev_offsets_velo_track_hit_number_t, uint) dev_velo_track_hit_number;
        
    DEVICE_OUTPUT(dev_saxpy_output_t, float) dev_saxpy_output;

138
139
    PROPERTY(saxpy_scale_factor_t, "saxpy_scale_factor", "scale factor a used in a*x + y", float) saxpy_scale_factor;
    PROPERTY(block_dim_t, "block_dim", "block dimensions", DeviceDimensions);
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
140
  };
Daniel Campora's avatar
Daniel Campora committed
141
142
```

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
143
144

In the `saxpy` namespace the inputs and outputs are specified. These can refer to data either on the host or on the device. So one can choose between `HOST_INPUT`, `HOST_OUTPUT`, `DEVICE_INPUT` and `DEVICE_OUTPUT`.
145
146
In all cases the name and type are defined, e.g. ` DEVICE_INPUT(dev_offsets_all_velo_tracks_t, uint)`. An algorithm can be called several times in a sequence with different inputs, outputs and properties.
The default input name to be used can be set by ` DEVICE_INPUT(dev_offsets_all_velo_tracks_t, uint) dev_atomics_velo;`. A `PROPERTY` describes a constant used in an algorithm, the default value of a property is set when declaring it, as described below.
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
147
148


Daniel Campora's avatar
Daniel Campora committed
149
```clike=
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
150
__global__ void saxpy(Parameters);
Daniel Campora's avatar
Daniel Campora committed
151
152

```
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
153
The function of the algorithm is defined. 
Daniel Campora's avatar
Daniel Campora committed
154

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
```clike=
  template<typename T, char... S>
  struct saxpy_t : public DeviceAlgorithm, Parameters {
    constexpr static auto name = Name<S...>::s;
    decltype(global_function(saxpy)) function {saxpy};

    void set_arguments_size(
      ArgumentRefManager<T> arguments,
      const RuntimeOptions&,
      const Constants&,
      const HostBuffers&) const
    {
      set_size<dev_saxpy_output_t>(
                                   arguments, value<host_number_of_selected_events_t>(arguments));
    }

    void operator()(
      const ArgumentRefManager<T>& arguments,
      const RuntimeOptions&,
      const Constants& constants,
      HostBuffers&,
      cudaStream_t& cuda_stream,
      cudaEvent_t&) const
    {
      function(dim3(value<host_number_of_selected_events_t>(arguments) / property<block_dim_t>()), property<block_dim_t>(), cuda_stream)(
        Parameters {begin<dev_saxpy_output_t>(arguments),
            property<saxpy_scale_factor_t>()});
    }

    private:
185
186
        Property<saxpy_scale_factor_t> m_saxpy_factor {this, 2.f};
        Property<block_dim_t> m_block_dim {this, {32, 1, 1}};
Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
187
188
189
190
191
192
193
194
  };
} // namespace saxpy
```
In ` struct saxpy_t`, the function call is handled. Note that the name of the struct must match the function name (`saxpy`), followed by `_t`.
In `set_arguments_size`, the sizes of `DEVICE_OUTPUT` parameters are defined. The actual memory allocation is handled by the memory manager. In our case, for `dev_saxpy_output_t` we reserve `<host_number_of_selected_events_t * sizeof(float)` bytes of memory.
The `sizeof(float)` is implicit, because we set the type of `dev_saxpy_output_t` to float in the `Parameters` struct.
In the call to `function` the first two arguments are the number of blocks per grid (`dim3(value<host_number_of_selected_events_t>(arguments) / property<block_dim_t>())`) and the number
of threads per block (`property<block_dim_t>()`). The struct `Parameters` contains the pointers to all `DEVICE_INPUT` and `DEVICE_OUTPUT` which were defined in the `Parameters` struct above, as well as the `PROPERTY`s.
195
Finally, the properties belonging to the algorithm are defined as private members of the `saxpy_t` struct together with their default values.
Daniel Campora's avatar
Daniel Campora committed
196

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
If a new variable is required in host memory, allocate its memory like so:
Go to `stream/sequence/include/HostBuffers.cuh` and add the new host memory pointer:

```clike
  // Pinned host datatypes
  uint* host_velo_tracks_atomics;
  uint* host_velo_track_hit_number;
```

Reserve that host memory in `stream/sequence/src/HostBuffers.cu`:

```clike
  cudaCheck(cudaMallocHost((void**)&host_velo_tracks_atomics, (2 * max_number_of_events + 1) * sizeof(int)));
  cudaCheck(cudaMallocHost((void**)&host_velo_track_hit_number, max_number_of_events * VeloTracking::max_tracks * sizeof(uint)));
 
```

Daniel Campora's avatar
Daniel Campora committed
214

Dorothea Vom Bruch's avatar
Dorothea Vom Bruch committed
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
Next, we add the [source file](https://gitlab.cern.ch/lhcb/Allen/blob/dovombru_update_documentation/cuda/example/src/SAXPY_example.cu):

```clike
#include "SAXPY_example.cuh"

__global__ void saxpy::saxpy(
  saxpy::Parameters parameters)
  {
    const uint number_of_events = gridDim.x;
    const uint event_number = blockIdx.x * blockDim.x + threadIdx.x;

    Velo::Consolidated::ConstTracks velo_tracks {
      parameters.dev_atomics_velo, parameters.dev_velo_track_hit_number, event_number, number_of_evnts};
    const uint number_of_tracks_event = velo_tracks.number_of_tracks(event_number);
    
    if (event_number < number_of_events)
      parameters.dev_saxpy_output[event_number] = parameters.saxpy_scale_factor * number_of_tracks_event + number_of_tracks_event;
}
```
The source code looks like any other CUDA function, with the only difference being that Allen inputs and outputs, as well as properties are passed via the `saxpy::Parameters` struct. 
The are accessed as in `parameters.dev_atomics_velo` (DEVICE_INPUT) or `parametres.saxpy_scale_factor` (PROPERTY).
Daniel Campora's avatar
Daniel Campora committed
236

237
To integrate the new algorithm into a sequence, please follow [this]() readme.