AthCUDA Core Code Additions, master branch (2020.06.19.) (!33960) · Merge requests · atlas / athena

Attila Krasznahorkay requested to merge akraszna/athena:AthCUDACore-master-20200618 into master Jun 19, 2020

Following up from our most recent discussion about it on Monday, I finally spent a bit of time updating the code that I wrote as part of akraszna/asyncgaudi, to be included into this repository.

The MR introduces multiple new packages:

AthCUDAInterfaces: A bit un-intuitively this package sits at the bottom of the inheritance tree (and not AthCUDACore!). It introduces the AthCUDA::IKernelTask and AthCUDA::StreamHolder "helper types", and the AthCUDA::IStreamPoolSvc and AthCUDA::IKernelRunnerSvc service interfaces.
- AthCUDA::IKernelTask provides an abstract interface for GPU "tasks" that may be scheduled by a service to run either on an NVidia GPU or on the CPU. It is designed to work together with the AthCUDA::IKernelRunnerSvc interface of course.
- AthCUDA::StreamHolder is a pure C++ class that can be used to pass around cudaStream_t variables in non-CUDA code. And AthCUDA::IStreamPoolSvc is a pure C++ interface for providing such "holder" objects to client code.
AthCUDACore: This is a collection of widely used macros and memory handling helper code.
AthCUDAServices: This package provides an implementation for the AthCUDA::IStreamPoolSvc and AthCUDA::IKernelRunnerSvc interfaces with AthCUDA::StreamPoolSvc and AthCUDA::KernelRunnerSvc.
- The implementation of these services is a bit tricky, but just because of how we need to shield CUDA code from practically any parts of Gaudi by now.
AthCUDAKernel: Now, this is the really difficult part of this whole thing... I was considering leaving it out of this MR, and introducing it separately to make the review a little easier, but then I thought that reviewing what the interfaces and services are there for would be a lot harder without having an actual example of how they are all meant to work together. So this package holds a bunch of code for setting up dual-compile (CPU/GPU) code based off of functors that a user would set up. Functors that just operate on simple arrays.

Finally, the MR also updates AthExCUDA to make use of the newly introduced code. This is probably what any reviewer would want to look at first. I introduced a new algorithm (AthCUDAExamples::LinearsTransformTaskExampleAlg) that makes use of this infrastructure to do exactly the same as the old algorithm (now renamed to AthCUDAExamples::LinearTransformStandaloneExampleAlg) is doing. But in a way that should scale much better to larger codebases. The example jobO for the new algorithm (LinearTransformTaskExample_jobOptions.py) also provides a recipe for launching the GPU tasks from "blocking" algorithms, which the avalanche scheduler would handle in a special way.

So... This code is fairly involved... 😦 So let's see if I can get some feedback at first from @fwinkl, @baines, @demelian, @leggett, @ssnyder or @tsulaia. I would hold off with longer explanations about the code until the first questions about it. 😉

Let me just put here, that the example job is meant to run like this:

...
AthenaHiveEventLoopMgr                              INFO Initializing AthenaHiveEventLoopMgr - package version AthenaServices-00-00-00
CUDAStreamPoolSvc                              0    INFO Allocated 5 stream(s)
CUDAKernelRunnerSvc                            0    INFO Started service for running 5 GPU kernel(s) in parallel on device(s):
 /-- Device ID 0 -------------------------------\
 | Name: GeForce GTX 960                        |
 | Max. threads per block: 1024                 |
 | Concurrent kernels: true                     |
 | Total memory: 1999.81 MB                     |
 \----------------------------------------------/
ThreadPoolSvc                                  0    INFO no thread init tools attached
AvalancheSchedulerSvc                          0    INFO Activating scheduler in a separate thread
AvalancheSchedulerSvc                          0    INFO Found 6 algorithms
AvalancheSchedulerSvc                          0    INFO No unmet INPUT data dependencies were found
PrecedenceSvc                                  0    INFO Assembling CF and DF task precedence rules
PrecedenceSvc                                  0    INFO PrecedenceSvc initialized successfully
AvalancheSchedulerSvc                          0    INFO Concurrency level information:
AvalancheSchedulerSvc                          0    INFO  o Number of events in flight: 4
AvalancheSchedulerSvc                          0    INFO  o TBB thread pool size:  'ThreadPoolSize':4
AvalancheSchedulerSvc                          0    INFO Task scheduling settings:
AvalancheSchedulerSvc                          0    INFO  o Avalanche generation mode: disabled
AvalancheSchedulerSvc                          0    INFO  o Preemptive scheduling of CPU-blocking tasks: enabled (max. 5 concurrent tasks)
AvalancheSchedulerSvc                          0    INFO  o Scheduling of condition tasks: disabled
ApplicationMgr                                 0    INFO Application Manager Initialized successfully
ApplicationMgr                                 0    INFO Application Manager Started successfully
AthenaHiveEventLoopMgr                         0    INFO Starting loop on events
AthenaHiveEventLoopMgr                     0   0    INFO   ===>>>  start of run 1    <<<===
AthenaHiveEventLoopMgr                     0   0    INFO   ===>>>  start processing event #1, run #1 on slot 0,  0 events processed so far  <<<===
AthenaHiveEventLoopMgr                  1000   0    INFO   ===>>>  start processing event #1001, run #1 on slot 0,  997 events processed so far  <<<===
...
AthenaHiveEventLoopMgr                  8999   3    INFO   ===>>>  done processing event #9000, run #1 on slot 3,  8998 events processed so far  <<<===
AthenaHiveEventLoopMgr                  9999   3    INFO ---> Loop Finished (seconds): 2.3911
ApplicationMgr                                      INFO Application Manager Stopped successfully
IncidentProcAlg1                                    INFO Finalize
SGInputLoader                                       INFO Finalizing SGInputLoader...
IncidentProcAlg2                                    INFO Finalize
AvalancheSchedulerSvc                               INFO Joining Scheduler thread
CUDAKernelRunnerSvc                                 INFO  o All task(s) executed: 10000
CUDAKernelRunnerSvc                                 INFO  o GPU task(s) executed: 10000 (100%)
CUDAStreamPoolSvc                                   INFO Destroyed all streams
EventDataSvc                                        INFO Finalizing EventDataSvc - package version StoreGate-00-00-00
...

😄

AthCUDA Core Code Additions, master branch (2020.06.19.)

Merge request reports