Skip to content

Draft: DataObject Parallel Delete, master branch (2021.06.01.)

With many-thread (>32) ATLAS jobs the event loop manager starts using a significant amount of time for just deleting objects from the event/data store, in the main thread of the application. In fact, the event loop manager thread is dominated by the freeing of memory.

DataObject_release_topdown_before

This is something that I wrote up in ATEAM-748. 😉

One could think of solving this bottleneck issue in a number of different ways. But I thought this would be the one requiring the least amount of work. 😛 Here I updated DataObject::release() to push the deletion of individual objects into separate TBB tasks. Which is not a great solution by any means, but it did still improve my profiled ATLAS reconstruction job immensely.

DataObject_release_topdown_after

The scheduling of the TBB tasks is still dominating the event loop manager's thread, but it's not pegging it at 100% anymore. Allowing the CPU usage of my test job to go from:

DataObject_release_summary_before

To:

DataObject_release_summary_after

It may not look like much, but the event processing rate of my test job more than doubled with this few-line change. I do recognise though that this update could negatively impact experiments/jobs that don't struggle with how much time it takes to remove their reconstructed events from memory. So let's discuss, whether we want to do anything like this. Or we would rather do something a bit more elaborate specifically in the ATLAS code, just for our experiment. (@ssnyder is working on such a solution at the moment...)

Also pinging @fwinkl, @leggett, @rbielski, @goetz, @bwynne, @christos.

Merge request reports