AthCUDA Core Code Additions, master branch (2020.06.19.)
Following up from our most recent discussion about it on Monday, I finally spent a bit of time updating the code that I wrote as part of akraszna/asyncgaudi, to be included into this repository.
The MR introduces multiple new packages:
-
AthCUDAInterfaces
: A bit un-intuitively this package sits at the bottom of the inheritance tree (and notAthCUDACore
!). It introduces theAthCUDA::IKernelTask
andAthCUDA::StreamHolder
"helper types", and theAthCUDA::IStreamPoolSvc
andAthCUDA::IKernelRunnerSvc
service interfaces.-
AthCUDA::IKernelTask
provides an abstract interface for GPU "tasks" that may be scheduled by a service to run either on an NVidia GPU or on the CPU. It is designed to work together with theAthCUDA::IKernelRunnerSvc
interface of course. -
AthCUDA::StreamHolder
is a pure C++ class that can be used to pass around cudaStream_t variables in non-CUDA code. AndAthCUDA::IStreamPoolSvc
is a pure C++ interface for providing such "holder" objects to client code.
-
-
AthCUDACore
: This is a collection of widely used macros and memory handling helper code. -
AthCUDAServices
: This package provides an implementation for theAthCUDA::IStreamPoolSvc
andAthCUDA::IKernelRunnerSvc
interfaces withAthCUDA::StreamPoolSvc
andAthCUDA::KernelRunnerSvc
.- The implementation of these services is a bit tricky, but just because of how we need to shield CUDA code from practically any parts of Gaudi by now.
-
AthCUDAKernel
: Now, this is the really difficult part of this whole thing... I was considering leaving it out of this MR, and introducing it separately to make the review a little easier, but then I thought that reviewing what the interfaces and services are there for would be a lot harder without having an actual example of how they are all meant to work together. So this package holds a bunch of code for setting up dual-compile (CPU/GPU) code based off of functors that a user would set up. Functors that just operate on simple arrays.
Finally, the MR also updates AthExCUDA to make use of the newly introduced code. This is probably what any reviewer would want to look at first. I introduced a new algorithm (AthCUDAExamples::LinearsTransformTaskExampleAlg
) that makes use of this infrastructure to do exactly the same as the old algorithm (now renamed to AthCUDAExamples::LinearTransformStandaloneExampleAlg
) is doing. But in a way that should scale much better to larger codebases. The example jobO for the new algorithm (LinearTransformTaskExample_jobOptions.py
) also provides a recipe for launching the GPU tasks from "blocking" algorithms, which the avalanche scheduler would handle in a special way.
So... This code is fairly involved...
Let me just put here, that the example job is meant to run like this:
...
AthenaHiveEventLoopMgr INFO Initializing AthenaHiveEventLoopMgr - package version AthenaServices-00-00-00
CUDAStreamPoolSvc 0 INFO Allocated 5 stream(s)
CUDAKernelRunnerSvc 0 INFO Started service for running 5 GPU kernel(s) in parallel on device(s):
/-- Device ID 0 -------------------------------\
| Name: GeForce GTX 960 |
| Max. threads per block: 1024 |
| Concurrent kernels: true |
| Total memory: 1999.81 MB |
\----------------------------------------------/
ThreadPoolSvc 0 INFO no thread init tools attached
AvalancheSchedulerSvc 0 INFO Activating scheduler in a separate thread
AvalancheSchedulerSvc 0 INFO Found 6 algorithms
AvalancheSchedulerSvc 0 INFO No unmet INPUT data dependencies were found
PrecedenceSvc 0 INFO Assembling CF and DF task precedence rules
PrecedenceSvc 0 INFO PrecedenceSvc initialized successfully
AvalancheSchedulerSvc 0 INFO Concurrency level information:
AvalancheSchedulerSvc 0 INFO o Number of events in flight: 4
AvalancheSchedulerSvc 0 INFO o TBB thread pool size: 'ThreadPoolSize':4
AvalancheSchedulerSvc 0 INFO Task scheduling settings:
AvalancheSchedulerSvc 0 INFO o Avalanche generation mode: disabled
AvalancheSchedulerSvc 0 INFO o Preemptive scheduling of CPU-blocking tasks: enabled (max. 5 concurrent tasks)
AvalancheSchedulerSvc 0 INFO o Scheduling of condition tasks: disabled
ApplicationMgr 0 INFO Application Manager Initialized successfully
ApplicationMgr 0 INFO Application Manager Started successfully
AthenaHiveEventLoopMgr 0 INFO Starting loop on events
AthenaHiveEventLoopMgr 0 0 INFO ===>>> start of run 1 <<<===
AthenaHiveEventLoopMgr 0 0 INFO ===>>> start processing event #1, run #1 on slot 0, 0 events processed so far <<<===
AthenaHiveEventLoopMgr 1000 0 INFO ===>>> start processing event #1001, run #1 on slot 0, 997 events processed so far <<<===
...
AthenaHiveEventLoopMgr 8999 3 INFO ===>>> done processing event #9000, run #1 on slot 3, 8998 events processed so far <<<===
AthenaHiveEventLoopMgr 9999 3 INFO ---> Loop Finished (seconds): 2.3911
ApplicationMgr INFO Application Manager Stopped successfully
IncidentProcAlg1 INFO Finalize
SGInputLoader INFO Finalizing SGInputLoader...
IncidentProcAlg2 INFO Finalize
AvalancheSchedulerSvc INFO Joining Scheduler thread
CUDAKernelRunnerSvc INFO o All task(s) executed: 10000
CUDAKernelRunnerSvc INFO o GPU task(s) executed: 10000 (100%)
CUDAStreamPoolSvc INFO Destroyed all streams
EventDataSvc INFO Finalizing EventDataSvc - package version StoreGate-00-00-00
...