Whilst working on the RICH refractive index calibration task, where I need to test multiple run changes, I observed what appeared to be a sizeable memory leak when running a single task over many runs, and thus with multiple run changes and condition updates.
I've since investigated and it appears to be nothing to do with the calibration task, but it appears (to me) to be a leak somewhere in the dd4hep condition updates.
To show this I have pushed some minor updates to !3928 (merged) to illustrate the issue using a simple task.
If you use the branch there, and then run by hand the test
between True and False, you will switch the update data the test uses from running over a single run, to a special set of MDF files I prepared for the ref. index calibration testing where I have 10k events from a large number of runs. The idea is using these file the task then sees a number of run changes and thus condition updates.
Here are the logs for two runs, using either a single run or multiple runs. Otherwise, the tasks are identically configured.
Both are running over 1M events. Both have the MemoryAuditor enabled.
For the task running over a single run, the memory usage hits 830MB after the first condition update, and pretty much stays there for the entire job.
For the task running over many runs, you can see the memory usage jumps significantly after each condition update. By the end of the task it is using 6.3GB....
I've looked into things from the RICH side, the derived condition objects I use, and I do not see what I could be doing there that would directly cause this. So my suspicion (unproven as of yet) is the issue is more on the framework side.
@clemenci@bcouturi What exactly happens on condition updates ? Are the previous condition objects kept forever, or eventually purged ?
Edited
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
The conditions are held by a DD4hep "ConditionSlice". When requesting the conditions for a specific IOV, the The DetectorDataService returns a shared pointer to matching slice (https://gitlab.cern.ch/lhcb/Detector/-/blob/master/Core/src/DetectorDataService.cpp#L181) which is also cached. When a slice is not needed any more, LHCb calls "drop_slice" to remove it from the cache. We can add some logging to ensure this all happens as it should.
@bcouturi I think from the memory logging its clear something is not being dropped when it should...
I am trying to look into things from the RICH derived condition side of things, to make sure they are all be properly constructed/destructed. Note, these are not simple data caches, they have allocated storage of their own, so if dd4hep is somehow not properly calling the destructors for these objects, this would lead to a leak..
It would be useful I think if you (or @clemenci) could take a look from the core side, so independent of the RICH objects, to see if all the condition objects are being properly memory managed on condition changes (if you need a MDF file with a lot of run changes just use the ones I post above).
Thanks @clemenci ! I can definitely live with it for a bit, my major concern was getting it squashed before I reploy the task for the condition updates at the pit, where it would run into problems if it still had such a leak..
Constructing RichX:Rich1 0x7ffc846eb160 : 3 active objectsMove Constructing RichX:Rich1 0x7ffc846f1ec0 (from 0x7ffc846eb160) : 4 active objectsMove Constructing RichX:Rich1 0x4f340040 (from 0x7ffc846f1ec0) : 5 active objects Destructing RichX:Rich1 0x7ffc846f1ec0 : 4 active objects Destructing RichX:Rich1 0x7ffc846eb160 : 3 active objects
and then in the following one
Constructing RichX:Rich1 0x7ffc846eb160 : 4 active objectsMove Constructing RichX:Rich1 0x7ffc846f1ec0 (from 0x7ffc846eb160) : 5 active objectsMove Constructing RichX:Rich1 0x4d0a5960 (from 0x7ffc846f1ec0) : 6 active objects Destructing RichX:Rich1 0x7ffc846f1ec0 : 5 active objects Destructing RichX:Rich1 0x7ffc846eb160 : 4 active objects
so indeed on each update what I see missing is any deletion of previous cache objects. Once I have your update @clemenci I can use this to double check things.
@clemenci I've updated the log with the new debugging I added, just to make it clearer what is happening.
One small detail, note I also do not see any of the allocated condition objects deleted at the end of the job either. So not only are they not dropped on run changes as no longer needed, but it appears there is no clear up (deletion) of them in general at the end of the job either. Perhaps the same reason...
"
@clemenci Thanks for Detector!339 (merged). I have just tried it though I am afraid it causes a seg. fault when I run over the data set with multiple runs. See attached log
RichFutureDecode FATAL Standard std::exception is caught RichFutureDecode ERROR Error getting condition '/world/BeforeMagnetRegion/Rich1:PDMDBDecodeMapping-Handler-RichFutureDecode' (3D332886AF71AF6F): Type mis-match, Found='UNKNOWN' Expected='Rich::Future::DAQ::PDMDBDecodeMapping'MemoryAuditor INFO Memory usage has changed after RichFutureDecode Execute virtual size = 2074.74 MB resident set size = 844.809 MBAll FATAL Standard std::exception is caught All ERROR Error getting condition '/world/BeforeMagnetRegion/Rich1:PDMDBDecodeMapping-Handler-RichFutureDecode' (3D332886AF71AF6F): Type mis-match, Found='UNKNOWN' Expected='Rich::Future::DAQ::PDMDBDecodeMapping'EventLoopMgr FATAL .executeEvent(): Standard std::exception thrown by AllEventLoopMgr ERROR Error getting condition '/world/BeforeMagnetRegion/Rich1:PDMDBDecodeMapping-Handler-RichFutureDecode' (3D332886AF71AF6F): Type mis-match, Found='UNKNOWN' Expected='Rich::Future::DAQ::PDMDBDecodeMapping'EventLoopMgr WARNING Execution of algorithm All failedEventLoopMgr ERROR Error processing event loop.
@clemenci Note you can reproduce the above crash by just running the already existing QMT test RichDetectors/tests/qmtest/test-run-change-2022-data.qmt i.e.
@clemenci Also note, even running over the data without run changes, Detector!339 (merged) seems not to fix everything as I still see a mismatch in the constructor/destructor calls. See
RichX::Rich1::TrackConstructDestruct ERROR Still 1 allocated dervived condition objects !!RichX::Rich2::TrackConstructDestruct ERROR Still 1 allocated dervived condition objects !!
I've added something to track the allocated objects, and the above is printed if this list is not empty at the end of the job, which it should be if all allocated derived condition objects where correctly destructed. The destructors for the RICH derived condition objects have to be correctly called when the objects are dropped, as otherwise the storage they have internally allocated will not be properly cleaned up, and we will still have a major memory leak.
and these do not change on run boundaries in 2022, so I would indeed not expect these to need to be recreated on run changes.
It looks like from the errors above there is some sort of error in the IOV pools and these derived condition objects are dropped in some way, but not remade ?
I think I understand why you are missing the call to the destructor: we do not properly clean up at finalize. This is easy to fix.
The segfault I think comes from the clean up I added, which might be too agressive. To be checked (that code is used only in tests where we drop everything to recreate it just after).
The crash is gone, and it looks like the number of derived conditions objects stabilises at three (which seems the number of active IOV pools you keep ?). Consequently the memory also is now stable.
There is still one small problem which is the final destruction of the cache objects during finalise() does not seem to be working. I do not see any final destruct calls for the last 3 condition objects, and thus still have
Rich::Detector::Rich1::AllocateCount ERROR Still 3 allocated dervived condition objects !!Rich::Detector::Rich2::AllocateCount ERROR Still 3 allocated dervived condition objects !!
@clemenci I just did a quick check and it looks like the problem is the application mgr is never actually calling LHCb::Detector::DetectorDataService::finalize()...
Thanks. Running with the above MR I indeed see the cleanup running in finalise. I do get some slightly odd looking messages though
Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[0-9223372036854775807] age: 0 [ 138 entries]Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[200000-9223372036854775807] age: 0 [10586 entries]Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[251925-251926] age: 10 [ 2 entries]Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[251927-251928] age: 9 [ 2 entries]Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[251929-251931] age: 8 [ 2 entries]Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[251932-251936] age: 7 [ 2 entries]Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[251940-252185] age: 3 [ 2 entries]Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[252186-252186] age: 2 [ 13 entries]Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[252186-255357] age: 2 [ 2 entries]Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[254915-9223372036854775807] age: 0 [ 2 entries]Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[255358-255364] age: 1 [ 13 entries]Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[255365-9223372036854775807] age: 0 [ 13 entries]ToolSvc INFO Removing all tools created by ToolSvcApplicationMgr INFO Application Manager Finalized successfullyApplicationMgr INFO Application Manager Terminated successfully
Are these Unhandled messages expected ? Not a big deal but are a little ugly..
The clean up logic leading to a segfault was too aggressive, but the change I made to fix it is not perfect (it leaks a bit). Moreover I believe the clean up in finalize is reporting that we didn't do the piecemeal clean up.