Following up from our most recent discussion about it on Monday, I finally spent a bit of time updating the code that I wrote as part of akraszna/asyncgaudi, to be included into this repository.
The MR introduces multiple new packages:
AthCUDAInterfaces: A bit un-intuitively this package sits at the bottom of the inheritance tree (and not
AthCUDACore!). It introduces the
AthCUDA::StreamHolder"helper types", and the
AthCUDA::IKernelTaskprovides an abstract interface for GPU "tasks" that may be scheduled by a service to run either on an NVidia GPU or on the CPU. It is designed to work together with the
AthCUDA::IKernelRunnerSvcinterface of course.
AthCUDA::StreamHolderis a pure C++ class that can be used to pass around cudaStream_t variables in non-CUDA code. And
AthCUDA::IStreamPoolSvcis a pure C++ interface for providing such "holder" objects to client code.
AthCUDACore: This is a collection of widely used macros and memory handling helper code.
AthCUDAServices: This package provides an implementation for the
- The implementation of these services is a bit tricky, but just because of how we need to shield CUDA code from practically any parts of Gaudi by now.
AthCUDAKernel: Now, this is the really difficult part of this whole thing... I was considering leaving it out of this MR, and introducing it separately to make the review a little easier, but then I thought that reviewing what the interfaces and services are there for would be a lot harder without having an actual example of how they are all meant to work together. So this package holds a bunch of code for setting up dual-compile (CPU/GPU) code based off of functors that a user would set up. Functors that just operate on simple arrays.
Finally, the MR also updates AthExCUDA to make use of the newly introduced code. This is probably what any reviewer would want to look at first. I introduced a new algorithm (
AthCUDAExamples::LinearsTransformTaskExampleAlg) that makes use of this infrastructure to do exactly the same as the old algorithm (now renamed to
AthCUDAExamples::LinearTransformStandaloneExampleAlg) is doing. But in a way that should scale much better to larger codebases. The example jobO for the new algorithm (
LinearTransformTaskExample_jobOptions.py) also provides a recipe for launching the GPU tasks from "blocking" algorithms, which the avalanche scheduler would handle in a special way.
So... This code is fairly involved...
Let me just put here, that the example job is meant to run like this:
... AthenaHiveEventLoopMgr INFO Initializing AthenaHiveEventLoopMgr - package version AthenaServices-00-00-00 CUDAStreamPoolSvc 0 INFO Allocated 5 stream(s) CUDAKernelRunnerSvc 0 INFO Started service for running 5 GPU kernel(s) in parallel on device(s): /-- Device ID 0 -------------------------------\ | Name: GeForce GTX 960 | | Max. threads per block: 1024 | | Concurrent kernels: true | | Total memory: 1999.81 MB | \----------------------------------------------/ ThreadPoolSvc 0 INFO no thread init tools attached AvalancheSchedulerSvc 0 INFO Activating scheduler in a separate thread AvalancheSchedulerSvc 0 INFO Found 6 algorithms AvalancheSchedulerSvc 0 INFO No unmet INPUT data dependencies were found PrecedenceSvc 0 INFO Assembling CF and DF task precedence rules PrecedenceSvc 0 INFO PrecedenceSvc initialized successfully AvalancheSchedulerSvc 0 INFO Concurrency level information: AvalancheSchedulerSvc 0 INFO o Number of events in flight: 4 AvalancheSchedulerSvc 0 INFO o TBB thread pool size: 'ThreadPoolSize':4 AvalancheSchedulerSvc 0 INFO Task scheduling settings: AvalancheSchedulerSvc 0 INFO o Avalanche generation mode: disabled AvalancheSchedulerSvc 0 INFO o Preemptive scheduling of CPU-blocking tasks: enabled (max. 5 concurrent tasks) AvalancheSchedulerSvc 0 INFO o Scheduling of condition tasks: disabled ApplicationMgr 0 INFO Application Manager Initialized successfully ApplicationMgr 0 INFO Application Manager Started successfully AthenaHiveEventLoopMgr 0 INFO Starting loop on events AthenaHiveEventLoopMgr 0 0 INFO ===>>> start of run 1 <<<=== AthenaHiveEventLoopMgr 0 0 INFO ===>>> start processing event #1, run #1 on slot 0, 0 events processed so far <<<=== AthenaHiveEventLoopMgr 1000 0 INFO ===>>> start processing event #1001, run #1 on slot 0, 997 events processed so far <<<=== ... AthenaHiveEventLoopMgr 8999 3 INFO ===>>> done processing event #9000, run #1 on slot 3, 8998 events processed so far <<<=== AthenaHiveEventLoopMgr 9999 3 INFO ---> Loop Finished (seconds): 2.3911 ApplicationMgr INFO Application Manager Stopped successfully IncidentProcAlg1 INFO Finalize SGInputLoader INFO Finalizing SGInputLoader... IncidentProcAlg2 INFO Finalize AvalancheSchedulerSvc INFO Joining Scheduler thread CUDAKernelRunnerSvc INFO o All task(s) executed: 10000 CUDAKernelRunnerSvc INFO o GPU task(s) executed: 10000 (100%) CUDAStreamPoolSvc INFO Destroyed all streams EventDataSvc INFO Finalizing EventDataSvc - package version StoreGate-00-00-00 ...