`cuda_check failed` running GPU build from stack
It's been noticed by several people that when running a GPU build (e.g. x86_64_v3-centos7-clang12+cuda11_4-opt
) from a the stack. If run
is called from Allen
(i.e. not MooreOnline
etc), then at the end of the program cuda will give an error (after Application Manager Terminates and everything else is otherwise successful). An example of this is given below:
Failed to run cudaFreeHost(ptr)
invalid argument (1) at ../backend/include/CUDABackend.h: 122
terminate called after throwing an instance of 'std::invalid_argument'
what(): cudaCheck failed
*** Break *** abort
Followed by a stack trace (given at the bottom). The major disadvantage of this is that it causes tests to fail, with the intention of adding cuda builds to the nightly tests, this will need to be fixed before those tests will pass.
Full stack trace:
*** Break *** abort
===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
Thread 7 (Thread 0x7fda51dff700 (LWP 32651) "python"):
#0 0x00007fdab5d79de2 in pthread_cond_timedwait
GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda5dac912f in ?? () from /lib64/libcuda.so.1
#2 0x00007fda5db44bd8 in ?? () from /lib64/libcuda.so.1
#3 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7fda515fe700 (LWP 32650) "cuda-EvtHandlr"):
#0 0x00007fdab538addd in poll () from /lib64/libc.so.6
#1 0x00007fda5db49bc9 in ?? () from /lib64/libcuda.so.1
#2 0x00007fda5dbf0d3b in ?? () from /lib64/libcuda.so.1
#3 0x00007fda5db44bd8 in ?? () from /lib64/libcuda.so.1
#4 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7fda505fc700 (LWP 32602) "python"):
#0 0x00007fdab538fe29 in syscall () from /lib64/libc.so.6
#1 0x00007fda9f7a37ca in tbb::internal::futex_wait (futex=0x7fda7b57112c, comparand=2) at ../../include/tbb/machine/linux_common.h:81
#2 tbb::internal::binary_semaphore::P (this=0x7fda7b57112c) at ../../src/tbb/semaphore.h:205
#3 rml::internal::thread_monitor::commit_wait (this=0x7fda7b571120, c=...) at ../../src/tbb/../rml/server/thread_monitor.h:255
#4 tbb::internal::rml::private_worker::run (this=0x7fda7b571100) at ../../src/tbb/private_server.cpp:273
#5 0x00007fda9f7a35c6 in tbb::internal::rml::private_worker::thread_routine (arg=0x7fda7b57112c) at ../../src/tbb/private_server.cpp:219
#6 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7fda0ffff700 (LWP 32595) "ZMQbg/IO/0"):
#0 0x00007fdab53960e3 in epoll_wait () from /lib64/libc.so.6
#1 0x00007fdaae7af48f in zmq::epoll_t::loop (this=0x2844a930) at src/epoll.cpp:184
#2 0x00007fdaae7ddff7 in thread_routine (arg_=0x2844a988) at src/thread.cpp:257
#3 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7fda50dfd700 (LWP 32594) "ZMQbg/Reaper"):
#0 0x00007fdab53960e3 in epoll_wait () from /lib64/libc.so.6
#1 0x00007fdaae7af48f in zmq::epoll_t::loop (this=0x229c8b70) at src/epoll.cpp:184
#2 0x00007fdaae7ddff7 in thread_routine (arg_=0x229c8bc8) at src/thread.cpp:257
#3 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fda5d8d2700 (LWP 32585) "cuda-EvtHandlr"):
#0 0x00007fdab538addd in poll () from /lib64/libc.so.6
#1 0x00007fda5db49bc9 in ?? () from /lib64/libcuda.so.1
#2 0x00007fda5dbf0d3b in ?? () from /lib64/libcuda.so.1
#3 0x00007fda5db44bd8 in ?? () from /lib64/libcuda.so.1
#4 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fdab6729740 (LWP 32511) "python"):
#0 0x00007fdab535c659 in waitpid () from /lib64/libc.so.6
#1 0x00007fdab52d9f62 in do_system () from /lib64/libc.so.6
#2 0x00007fdab52da311 in system () from /lib64/libc.so.6
#3 0x00007fdaaaad7eca in TUnixSystem::StackTrace() () from /cvmfs/lhcb.cern.ch/lib/lcg/releases/ROOT/6.28.00-536fc/x86_64-centos7-clang12-opt/lib/libCore.so
#4 0x00007fdaaad8b431 in (anonymous namespace)::TExceptionHandlerImp::HandleException(int) () from /cvmfs/lhcb.cern.ch/lib/lcg/releases/ROOT/6.28.00-536fc/x86_64-centos7-clang12-opt/lib/libcppyy_backend3_9.so
#5 0x00007fdaaaadb7d2 in TUnixSystem::DispatchSignals(ESignals) () from /cvmfs/lhcb.cern.ch/lib/lcg/releases/ROOT/6.28.00-536fc/x86_64-centos7-clang12-opt/lib/libCore.so
#6 <signal handler called>
#7 0x00007fdab52cd387 in raise () from /lib64/libc.so.6
#8 0x00007fdab52cea78 in abort () from /lib64/libc.so.6
#9 0x00007fdaae3767ec in __gnu_cxx::__verbose_terminate_handler () at /build/dkonst/gcc-clang/build/contrib/gcc-10.3.0/src/gcc/10.3.0/libstdc++-v3/libsupc++/vterminate.cc:95
#10 0x00007fdaae381a36 in __cxxabiv1::__terminate (handler=<optimized out>) at /build/dkonst/gcc-clang/build/contrib/gcc-10.3.0/src/gcc/10.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
#11 0x00007fdaae381aa1 in std::terminate () at /build/dkonst/gcc-clang/build/contrib/gcc-10.3.0/src/gcc/10.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
#12 0x00007fda82bff3bb in __clang_call_terminate () from /lhcb5/users/amorris/stack/Allen/build.x86_64_v3-centos7-clang12+cuda11_4-opt/libAllenLib.so
#13 0x0000000000000000 in ?? ()
===========================================================
The lines below might hint at the cause of the crash. If you see question
marks as part of the stack trace, try to recompile with debugging information
enabled and export CLING_DEBUG=1 environment variable before running.
You may get help by asking at the ROOT forum https://root.cern/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at https://root.cern/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#7 0x00007fdab52cd387 in raise () from /lib64/libc.so.6
#8 0x00007fdab52cea78 in abort () from /lib64/libc.so.6
#9 0x00007fdaae3767ec in __gnu_cxx::__verbose_terminate_handler () at /build/dkonst/gcc-clang/build/contrib/gcc-10.3.0/src/gcc/10.3.0/libstdc++-v3/libsupc++/vterminate.cc:95
#10 0x00007fdaae381a36 in __cxxabiv1::__terminate (handler=<optimized out>) at /build/dkonst/gcc-clang/build/contrib/gcc-10.3.0/src/gcc/10.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
#11 0x00007fdaae381aa1 in std::terminate () at /build/dkonst/gcc-clang/build/contrib/gcc-10.3.0/src/gcc/10.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
#12 0x00007fda82bff3bb in __clang_call_terminate () from /lhcb5/users/amorris/stack/Allen/build.x86_64_v3-centos7-clang12+cuda11_4-opt/libAllenLib.so
#13 0x0000000000000000 in ?? ()
===========================================================
*** Break *** abort
===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
Thread 7 (Thread 0x7fda51dff700 (LWP 32651) "python"):
#0 0x00007fdab5d79de2 in pthread_cond_timedwait
GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fda5dac912f in ?? () from /lib64/libcuda.so.1
#2 0x00007fda5db44bd8 in ?? () from /lib64/libcuda.so.1
#3 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7fda515fe700 (LWP 32650) "cuda-EvtHandlr"):
#0 0x00007fdab538addd in poll () from /lib64/libc.so.6
#1 0x00007fda5db49bc9 in ?? () from /lib64/libcuda.so.1
#2 0x00007fda5dbf0d3b in ?? () from /lib64/libcuda.so.1
#3 0x00007fda5db44bd8 in ?? () from /lib64/libcuda.so.1
#4 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7fda505fc700 (LWP 32602) "python"):
#0 0x00007fdab538fe29 in syscall () from /lib64/libc.so.6
#1 0x00007fda9f7a37ca in tbb::internal::futex_wait (futex=0x7fda7b57112c, comparand=2) at ../../include/tbb/machine/linux_common.h:81
#2 tbb::internal::binary_semaphore::P (this=0x7fda7b57112c) at ../../src/tbb/semaphore.h:205
#3 rml::internal::thread_monitor::commit_wait (this=0x7fda7b571120, c=...) at ../../src/tbb/../rml/server/thread_monitor.h:255
#4 tbb::internal::rml::private_worker::run (this=0x7fda7b571100) at ../../src/tbb/private_server.cpp:273
#5 0x00007fda9f7a35c6 in tbb::internal::rml::private_worker::thread_routine (arg=0x7fda7b57112c) at ../../src/tbb/private_server.cpp:219
#6 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7fda0ffff700 (LWP 32595) "ZMQbg/IO/0"):
#0 0x00007fdab53960e3 in epoll_wait () from /lib64/libc.so.6
#1 0x00007fdaae7af48f in zmq::epoll_t::loop (this=0x2844a930) at src/epoll.cpp:184
#2 0x00007fdaae7ddff7 in thread_routine (arg_=0x2844a988) at src/thread.cpp:257
#3 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7fda50dfd700 (LWP 32594) "ZMQbg/Reaper"):
#0 0x00007fdab53960e3 in epoll_wait () from /lib64/libc.so.6
#1 0x00007fdaae7af48f in zmq::epoll_t::loop (this=0x229c8b70) at src/epoll.cpp:184
#2 0x00007fdaae7ddff7 in thread_routine (arg_=0x229c8bc8) at src/thread.cpp:257
#3 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fda5d8d2700 (LWP 32585) "cuda-EvtHandlr"):
#0 0x00007fdab538addd in poll () from /lib64/libc.so.6
#1 0x00007fda5db49bc9 in ?? () from /lib64/libcuda.so.1
#2 0x00007fda5dbf0d3b in ?? () from /lib64/libcuda.so.1
#3 0x00007fda5db44bd8 in ?? () from /lib64/libcuda.so.1
#4 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fdab6729740 (LWP 32511) "python"):
#0 0x00007fdab535c659 in waitpid () from /lib64/libc.so.6
#1 0x00007fdab52d9f62 in do_system () from /lib64/libc.so.6
#2 0x00007fdab52da311 in system () from /lib64/libc.so.6
#3 0x00007fdaaaad7eca in TUnixSystem::StackTrace() () from /cvmfs/lhcb.cern.ch/lib/lcg/releases/ROOT/6.28.00-536fc/x86_64-centos7-clang12-opt/lib/libCore.so
#4 0x00007fdaaad8b4eb in (anonymous namespace)::TExceptionHandlerImp::HandleException(int) () from /cvmfs/lhcb.cern.ch/lib/lcg/releases/ROOT/6.28.00-536fc/x86_64-centos7-clang12-opt/lib/libcppyy_backend3_9.so
#5 0x00007fdaaaadb7d2 in TUnixSystem::DispatchSignals(ESignals) () from /cvmfs/lhcb.cern.ch/lib/lcg/releases/ROOT/6.28.00-536fc/x86_64-centos7-clang12-opt/lib/libCore.so
#6 <signal handler called>
#7 0x00007fdab52cd387 in raise () from /lib64/libc.so.6
#8 0x00007fdab52cea78 in abort () from /lib64/libc.so.6
#9 0x00007fdaae3767ec in __gnu_cxx::__verbose_terminate_handler () at /build/dkonst/gcc-clang/build/contrib/gcc-10.3.0/src/gcc/10.3.0/libstdc++-v3/libsupc++/vterminate.cc:95
#10 0x00007fdaae381a36 in __cxxabiv1::__terminate (handler=<optimized out>) at /build/dkonst/gcc-clang/build/contrib/gcc-10.3.0/src/gcc/10.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
#11 0x00007fdaae381aa1 in std::terminate () at /build/dkonst/gcc-clang/build/contrib/gcc-10.3.0/src/gcc/10.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
#12 0x00007fda82bff3bb in __clang_call_terminate () from /lhcb5/users/amorris/stack/Allen/build.x86_64_v3-centos7-clang12+cuda11_4-opt/libAllenLib.so
#13 0x0000000000000000 in ?? ()
===========================================================
The lines below might hint at the cause of the crash. If you see question
marks as part of the stack trace, try to recompile with debugging information
enabled and export CLING_DEBUG=1 environment variable before running.
You may get help by asking at the ROOT forum https://root.cern/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at https://root.cern/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
total 564M
lrwxrwxrwx 1 amorris lhcb 14 May 12 10:54 Makefile -> utils/Makefile
drwxr-xr-x 3 amorris lhcb 16 May 12 10:55 contrib
drwxr-xr-x 4 amorris lhcb 37 May 12 10:56 DBASE
drwxr-xr-x 3 amorris lhcb 23 May 12 10:56 PARAM
DumpMuonGeometry/DumpMuonGeometry #=1 Sum=1 Eff=|( 100.0000 +- 0.00000 )%|
DumpMuonTable/DumpMuonTable #=1 Sum=1 Eff=|( 100.0000 +- 0.00000 )%|
NONLAZY_OR: allen_non_event_data_producers #=1 Sum=1 Eff=|( 100.0000 +- 0.00000 )%|
AllenTESProducer/AllenTESProducer_VP #=1 Sum=1 Eff=|( 100.0000 +- 0.00000 )%|
AllenTESProducer/AllenTESProducer_ECal #=1 Sum=1 Eff=|( 100.0000 +- 0.00000 )%|
#2 0x00007fda5db44bd8 in ?? () from /lib64/libcuda.so.1
#3 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
#3 0x00007fdaaaad7eca in TUnixSystem::StackTrace() () from /cvmfs/lhcb.cern.ch/lib/lcg/releases/ROOT/6.28.00-536fc/x86_64-centos7-clang12-opt/lib/libCore.so
#5 0x00007fdaaaadb7d2 in TUnixSystem::DispatchSignals(ESignals) () from /cvmfs/lhcb.cern.ch/lib/lcg/releases/ROOT/6.28.00-536fc/x86_64-centos7-clang12-opt/lib/libCore.so
#4 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7fda505fc700 (LWP 32602) "python"):
#0 0x00007fdab538fe29 in syscall () from /lib64/libc.so.6
#1 0x00007fda9f7a37ca in tbb::internal::futex_wait (futex=0x7fda7b57112c, comparand=2) at ../../include/tbb/machine/linux_common.h:81
#2 tbb::internal::binary_semaphore::P (this=0x7fda7b57112c) at ../../src/tbb/semaphore.h:205
#3 rml::internal::thread_monitor::commit_wait (this=0x7fda7b571120, c=...) at ../../src/tbb/../rml/server/thread_monitor.h:255
#4 tbb::internal::rml::private_worker::run (this=0x7fda7b571100) at ../../src/tbb/private_server.cpp:273
#5 0x00007fda9f7a35c6 in tbb::internal::rml::private_worker::thread_routine (arg=0x7fda7b57112c) at ../../src/tbb/private_server.cpp:219
#6 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
#4 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7fda50dfd700 (LWP 32594) "ZMQbg/Reaper"):
#0 0x00007fdab53960e3 in epoll_wait () from /lib64/libc.so.6
#1 0x00007fdaae7af48f in zmq::epoll_t::loop (this=0x229c8b70) at src/epoll.cpp:184
#2 0x00007fdaae7ddff7 in thread_routine (arg_=0x229c8bc8) at src/thread.cpp:257
#3 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fda5d8d2700 (LWP 32585) "cuda-EvtHandlr"):
#0 0x00007fdab538addd in poll () from /lib64/libc.so.6
#1 0x00007fda5db49bc9 in ?? () from /lib64/libcuda.so.1
#2 0x00007fda5dbf0d3b in ?? () from /lib64/libcuda.so.1
#3 0x00007fda5db44bd8 in ?? () from /lib64/libcuda.so.1
#4 0x00007fdab5d75ea5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fdab5395b0d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fdab6729740 (LWP 32511) "python"):
#0 0x00007fdab535c659 in waitpid () from /lib64/libc.so.6
#1 0x00007fdab52d9f62 in do_system () from /lib64/libc.so.6
#2 0x00007fdab52da311 in system () from /lib64/libc.so.6
#3 0x00007fdaaaad7eca in TUnixSystem::StackTrace() () from /cvmfs/lhcb.cern.ch/lib/lcg/releases/ROOT/6.28.00-536fc/x86_64-centos7-clang12-opt/lib/libCore.so
#4 0x00007fdaaad8b4eb in (anonymous namespace)::TExceptionHandlerImp::HandleException(int) () from /cvmfs/lhcb.cern.ch/lib/lcg/releases/ROOT/6.28.00-536fc/x86_64-centos7-clang12-opt/lib/libcppyy_backend3_9.so
#5 0x00007fdaaaadb7d2 in TUnixSystem::DispatchSignals(ESignals) () from /cvmfs/lhcb.cern.ch/lib/lcg/releases/ROOT/6.28.00-536fc/x86_64-centos7-clang12-opt/lib/libCore.so
#6 <signal handler called>
#7 0x00007fdab52cd387 in raise () from /lib64/libc.so.6
#8 0x00007fdab52cea78 in abort () from /lib64/libc.so.6
#9 0x00007fdaae3767ec in __gnu_cxx::__verbose_terminate_handler () at /build/dkonst/gcc-clang/build/contrib/gcc-10.3.0/src/gcc/10.3.0/libstdc++-v3/libsupc++/vterminate.cc:95
#10 0x00007fdaae381a36 in __cxxabiv1::__terminate (handler=<optimized out>) at /build/dkonst/gcc-clang/build/contrib/gcc-10.3.0/src/gcc/10.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
#11 0x00007fdaae381aa1 in std::terminate () at /build/dkonst/gcc-clang/build/contrib/gcc-10.3.0/src/gcc/10.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
#12 0x00007fda82bff3bb in __clang_call_terminate () from /lhcb5/users/amorris/stack/Allen/build.x86_64_v3-centos7-clang12+cuda11_4-opt/libAllenLib.so
#13 0x0000000000000000 in ?? ()
===========================================================
The lines below might hint at the cause of the crash. If you see question
marks as part of the stack trace, try to recompile with debugging information
enabled and export CLING_DEBUG=1 environment variable before running.
You may get help by asking at the ROOT forum https://root.cern/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at https://root.cern/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#7 0x00007fdab52cd387 in raise () from /lib64/libc.so.6
#8 0x00007fdab52cea78 in abort () from /lib64/libc.so.6
#9 0x00007fdaae3767ec in __gnu_cxx::__verbose_terminate_handler () at /build/dkonst/gcc-clang/build/contrib/gcc-10.3.0/src/gcc/10.3.0/libstdc++-v3/libsupc++/vterminate.cc:95
#10 0x00007fdaae381a36 in __cxxabiv1::__terminate (handler=<optimized out>) at /build/dkonst/gcc-clang/build/contrib/gcc-10.3.0/src/gcc/10.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
#11 0x00007fdaae381aa1 in std::terminate () at /build/dkonst/gcc-clang/build/contrib/gcc-10.3.0/src/gcc/10.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
#12 0x00007fda82bff3bb in __clang_call_terminate () from /lhcb5/users/amorris/stack/Allen/build.x86_64_v3-centos7-clang12+cuda11_4-opt/libAllenLib.so
#13 0x0000000000000000 in ?? ()
===========================================================