Allow CUDA compilation for onnxruntime
This small MR allows for onnxruntime to be compiled for use on GPUs via CUDA. As long as CUDA and CUDNN is installed on the compiling machine, a GPU does not need to be physically present, though the CUDA and CUDNN install paths are currently hard coded as:
CUDA_HOME: /usr/local/cuda CUDNN_HOME: /usr/
which may not be correct on all machines.
As is, an extra CMake build flag "ATLonnxruntime_USE_CUDA" is added to enable CUDA compilation. Currently the default behavior is to only compile for CPU as before.
Merge request reports
Activity
AE Build SUCCESS
Build logfiles are available at Jenkins [AE-MERGE-REQUEST-CC7 #536]So... This is not as simple of a problem as I first thought.
Currently in our main nightly we set up CUDA "by hand", in the same way in which we set up the compiler as well. Completely separately from which LCG version we may want to use, relying on AtlasSetup to do this for us. (Since CMake treats
nvcc
the same as any C++ compiler these days, this does make sense from our side.)However, as I had to realise, SFT provides cuDNN, and also CUDA, as "regular packages".
The latter is only provided in this way, which also makes perfect sense.
Now... as you may see from those webpages, these two are only provided in the "CUDA LCG releases" at the moment. So to build onnxruntime with CUDA support turned on, I was using LCG_101cuda. Like:
cmake -DCMAKE_BUILD_TYPE=Release -DLCG_VERSION_NUMBER=101 -DLCG_VERSION_POSTFIX=cuda -DCTEST_USE_LAUNCHERS=TRUE ../atlasexternals/Projects/AthenaExternals/
With the updates that I now added to the MR, the build does succeed like this. But of course in our regular nightly cuDNN will not be available at the moment. Not unless we ask for its inclusion into the ATLAS layers...
@elmsheus, @emoyse, what do you think? Would it be outrageous to ask for the CUDA and cuDNN packages to be included into let's say
LCG_101_ATLAS_3
? Once they are, we may very well want to also re-think how AtlasSetup would handle CUDA. But that will be a separate discussion...Hi @akraszna,
asking for these 2 packages to be added to a new layer
LCG_101_ATLAS_3
sounds fine - I don't know if there might be technical reasons from the SFT side not to include them, though - will you open a SPI jira ticket ? N.B. to this new layer valgrind for gcc11 should be added as discussed in https://sft.its.cern.ch/jira/browse/SPI-1992Cheers, Johannes
AE Build SUCCESS
Build logfiles are available at Jenkins [AE-MERGE-REQUEST-CC7 #541]Here it is: https://sft.its.cern.ch/jira/browse/SPI-1996
added 24 commits
-
1a90f7a4...f2bbee2e - 20 commits from branch
master
- 83198b50 - allow cuda compilation for ort
- 8730cb03 - remove debug line
- 31a52746 - Added FindCUDAToolkit.cmake and FindcuDNN.cmake to AtlasLCG.
- a48bdd97 - Updated onnxruntime to be able to build against LCG_101cuda, with CUDA support turned on.
Toggle commit list-
1a90f7a4...f2bbee2e - 20 commits from branch
added 1 commit
- f07ff00d - Updated onnxruntime to be able to build against LCG_101cuda, with CUDA support turned on.
Unfortunately onnxruntime takes bloody forever to build with CUDA support turned on. It churns for O(10 minutes) in building CUDA code in a single-threaded way.
Could you guys check if any newer version addresses this? Since there are many newer versions than 1.5.1 by now.
For the nightly this is not necessarily a dealbreaker. But since on my own machine I can build AthenaExternals in <5 minutes, this is very noticable.
AE Build SUCCESS
Build logfiles are available at Jenkins [AE-MERGE-REQUEST-CC7 #554] AE Build SUCCESS
Build logfiles are available at Jenkins [AE-MERGE-REQUEST-CC7 #555]Hi @akraszna ,
Don't see any specific issue but just created one https://github.com/microsoft/onnxruntime/issues/9627 and conveyed to some known ORT developers. Meanwhile, I will check updating the ORT if the issue persists.
Thanks, Debo.
Hi @akraszna
Can we set a
CMAKE_BUILD_PARALLEL_LEVEL
here https://gitlab.cern.ch/atlas/atlasexternals/-/blob/master/External/onnxruntime/CMakeLists.txt#L23 ?Thanks, Debo
Okay, let's just go ahead with this one. See my comments in https://github.com/microsoft/onnxruntime/issues/9627 about the not-completely-ideal build performance...
mentioned in commit 1f303f75
mentioned in merge request !885 (merged)