Draft: Use mutexes instead of atomic ops on shared_ptr in CachedParticlePtr
This is a study into using mutexes instead of shared_ptr atomics in CachedParticlePtr to reduce high CPU usage due to a very suboptimal implementation of those atomic operations in libstdc++.
Using TBB spin_rw_mutex
means the additional memory required is 8 bytes per CachedParticlePtr (as opposed to 40 bytes with std::mutex
). The mutex implementation shows much lower CPU usage than the atomics implementation, though higher than I would expect.
Surprisingly, about 13% of CPU time during MT pileup digitization execute
is spent waiting on these mutexes. This is made up almost entirely of about 21.4% of time in the pixel digitization tool and 14.5% of time in the ITk strips digitization tool.
Pinging @jchapman @ssnyder @tadej
Pinging JIRA ticket ATLASSIM-4814