Memory leak with dd4hep condition updates

assigned to @clemenci and @bcouturi

The conditions are held by a DD4hep "ConditionSlice". When requesting the conditions for a specific IOV, the The DetectorDataService returns a shared pointer to matching slice (https://gitlab.cern.ch/lhcb/Detector/-/blob/master/Core/src/DetectorDataService.cpp#L181) which is also cached. When a slice is not needed any more, LHCb calls "drop_slice" to remove it from the cache. We can add some logging to ensure this all happens as it should.

@bcouturi I think from the memory logging its clear something is not being dropped when it should...

I am trying to look into things from the RICH derived condition side of things, to make sure they are all be properly constructed/destructed. Note, these are not simple data caches, they have allocated storage of their own, so if dd4hep is somehow not properly calling the destructors for these objects, this would lead to a leak..

It would be useful I think if you (or @clemenci) could take a look from the core side, so independent of the RICH objects, to see if all the condition objects are being properly memory managed on condition changes (if you need a MDF file with a lot of run changes just use the ones I post above).

There is a missing call in the bit of code that is supposed to delete unused IOV slices.

If you can live with it for a while, I'll prepare a MR with the fix in the coming days, otherwise I try to do it today.

Thanks @clemenci ! I can definitely live with it for a bit, my major concern was getting it squashed before I reploy the task for the condition updates at the pit, where it would run into problems if it still had such a leak..

So, just for completeness, I was working on some memory log debug for the RichX derived objects

      // debugging memory management issues
      struct LogConstructDestruct {
      public:
        LogConstructDestruct() {
          std::cout << "     Constructing RichX:" << RICH << " " << (void*)this;
          track( this );
          std::cout << std::endl;
        }
        LogConstructDestruct( const LogConstructDestruct& from ) {
          std::cout << "Copy Constructing RichX:" << RICH << " " << (void*)this << " (from " << &from << ")";
          track( this );
          std::cout << std::endl;
        }
        LogConstructDestruct( LogConstructDestruct&& from ) {
          std::cout << "Move Constructing RichX:" << RICH << " " << (void*)this << " (from " << &from << ")";
          track( this );
          std::cout << std::endl;
        }
        LogConstructDestruct& operator=( const LogConstructDestruct& from ) {
          std::cout << "Copy assignment   RichX:" << RICH << " " << (void*)this << " (from " << &from << ")"
                    << std::endl;
        }
        LogConstructDestruct& operator=( LogConstructDestruct&& from ) {
          std::cout << "Move assignment   RichX:" << RICH << " " << (void*)this << " (from " << &from << ")"
                    << std::endl;
        }
        ~LogConstructDestruct() {
          std::cout << "      Destructing RichX:" << RICH << " " << (void*)this;
          untrack( this );
          std::cout << std::endl;
        }

      private:
        static auto& values() {
          static std::set<const LogConstructDestruct*> v;
          return v;
        }
        static void track( const LogConstructDestruct* p ) {
          static std::mutex insert_lock;
          std::scoped_lock{ insert_lock };
          values().insert( p );
          std::cout << " : " << values().size() << " active objects";
        }
        static void untrack( const LogConstructDestruct* p ) {
          static std::mutex erase_lock;
          std::scoped_lock{ erase_lock };
          values().erase( p );
          std::cout << " : " << values().size() << " active objects";
        }
      };
      LogConstructDestruct m_log;

with this I see on one update

     Constructing RichX:Rich1 0x7ffc846eb160 : 3 active objects
Move Constructing RichX:Rich1 0x7ffc846f1ec0 (from 0x7ffc846eb160) : 4 active objects
Move Constructing RichX:Rich1 0x4f340040 (from 0x7ffc846f1ec0) : 5 active objects
      Destructing RichX:Rich1 0x7ffc846f1ec0 : 4 active objects
      Destructing RichX:Rich1 0x7ffc846eb160 : 3 active objects

and then in the following one

     Constructing RichX:Rich1 0x7ffc846eb160 : 4 active objects
Move Constructing RichX:Rich1 0x7ffc846f1ec0 (from 0x7ffc846eb160) : 5 active objects
Move Constructing RichX:Rich1 0x4d0a5960 (from 0x7ffc846f1ec0) : 6 active objects
      Destructing RichX:Rich1 0x7ffc846f1ec0 : 5 active objects
      Destructing RichX:Rich1 0x7ffc846eb160 : 4 active objects

so indeed on each update what I see missing is any deletion of previous cache objects. Once I have your update @clemenci I can use this to double check things.

@clemenci I've updated the log with the new debugging I added, just to make it clearer what is happening.

One small detail, note I also do not see any of the allocated condition objects deleted at the end of the job either. So not only are they not dropped on run changes as no longer needed, but it appears there is no clear up (deletion) of them in general at the end of the job either. Perhaps the same reason...

mentioned in merge request !3928 (merged)

changed the description

changed title from Possible memory leak with dd4hep condition updates to Memory leak with dd4hep condition updates

mentioned in merge request Detector!339 (merged)

" @clemenci Thanks for Detector!339 (merged). I have just tried it though I am afraid it causes a seg. fault when I run over the data set with multiple runs. See attached log

many-runs.log

RichFutureDecode    FATAL  Standard std::exception is caught 
RichFutureDecode    ERROR Error getting condition '/world/BeforeMagnetRegion/Rich1:PDMDBDecodeMapping-Handler-RichFutureDecode' (3D332886AF71AF6F): Type mis-match, Found='UNKNOWN' Expected='Rich::Future::DAQ::PDMDBDecodeMapping'
MemoryAuditor        INFO Memory usage has changed after RichFutureDecode Execute virtual size = 2074.74 MB resident set size = 844.809 MB
All                 FATAL  Standard std::exception is caught 
All                 ERROR Error getting condition '/world/BeforeMagnetRegion/Rich1:PDMDBDecodeMapping-Handler-RichFutureDecode' (3D332886AF71AF6F): Type mis-match, Found='UNKNOWN' Expected='Rich::Future::DAQ::PDMDBDecodeMapping'
EventLoopMgr        FATAL .executeEvent(): Standard std::exception thrown by All
EventLoopMgr        ERROR Error getting condition '/world/BeforeMagnetRegion/Rich1:PDMDBDecodeMapping-Handler-RichFutureDecode' (3D332886AF71AF6F): Type mis-match, Found='UNKNOWN' Expected='Rich::Future::DAQ::PDMDBDecodeMapping'
EventLoopMgr      WARNING Execution of algorithm All failed
EventLoopMgr        ERROR Error processing event loop.

@clemenci Note you can reproduce the above crash by just running the already existing QMT test RichDetectors/tests/qmtest/test-run-change-2022-data.qmt i.e.

gaudirun.py /path/to/LHCb/Rich/RichDetectors/tests/qmtest/test-run-change-2022-data.qmt

@clemenci Also note, even running over the data without run changes, Detector!339 (merged) seems not to fix everything as I still see a mismatch in the constructor/destructor calls. See

one-run.log

that was run with Detector!339 (merged). See the messages at the end

RichX::Rich1::TrackConstructDestruct   ERROR Still 1 allocated dervived condition objects !!
RichX::Rich2::TrackConstructDestruct   ERROR Still 1 allocated dervived condition objects !!

I've added something to track the allocated objects, and the above is printed if this list is not empty at the end of the job, which it should be if all allocated derived condition objects where correctly destructed. The destructors for the RICH derived condition objects have to be correctly called when the objects are dropped, as otherwise the storage they have internally allocated will not be properly cleaned up, and we will still have a major memory leak.

@clemenci The PDMDBDecodeMapping-Handler-RichFutureDecode condition object above is defined at

https://gitlab.cern.ch/lhcb/LHCb/-/blob/master/Rich/RichFutureDAQ/include/RichFutureDAQ/RichPDMDBDecodeMapping.h#L211

Note, this derived condition depends only on three condition objects

     "/world/BeforeMagnetRegion/Rich1:PDMDB_R_DecodePixelMap",
     "/world/AfterMagnetRegion/Rich2:PDMDB_R_DecodePixelMap",
     "/world/AfterMagnetRegion/Rich2:PDMDB_H_DecodePixelMap"

and these do not change on run boundaries in 2022, so I would indeed not expect these to need to be recreated on run changes.

It looks like from the errors above there is some sort of error in the IOV pools and these derived condition objects are dropped in some way, but not remade ?

I think I understand why you are missing the call to the destructor: we do not properly clean up at finalize. This is easy to fix.

The segfault I think comes from the clean up I added, which might be too agressive. To be checked (that code is used only in tests where we drop everything to recreate it just after).

Our comments crossed, but I agree with you and I want to investigate, but that will really have to wait a few days.

BTW, I just pushed a small change that should fix the final clean up. (not tested)

No problem and thanks for your help. Have a good evening.

With the latest change I pushed the error is fixed, but I'm not sure the memory profile makes sense. Would you mind to check? (tomorrow, of course)

Thanks @clemenci, I just gave the last version a go. See this log file

many-runs.log

The crash is gone, and it looks like the number of derived conditions objects stabilises at three (which seems the number of active IOV pools you keep ?). Consequently the memory also is now stable.

There is still one small problem which is the final destruction of the cache objects during finalise() does not seem to be working. I do not see any final destruct calls for the last 3 condition objects, and thus still have

Rich::Detector::Rich1::AllocateCount   ERROR Still 3 allocated dervived condition objects !!
Rich::Detector::Rich2::AllocateCount   ERROR Still 3 allocated dervived condition objects !!

at the end of the log file.

@clemenci I just did a quick check and it looks like the problem is the application mgr is never actually calling LHCb::Detector::DetectorDataService::finalize()...

Note the options used for the test are

https://gitlab.cern.ch/lhcb/LHCb/-/blob/master/Rich/RichDetectors/tests/options/decode-and-spacepoints.py

I don't think I can see anything wrong there from my side that would cause this ?

See !3932 (merged)

Thanks. Running with the above MR I indeed see the cleanup running in finalise. I do get some slightly odd looking messages though

Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[0-9223372036854775807]   age:  0 [ 138 entries]
Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[200000-9223372036854775807] age:  0 [10586 entries]
Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[251925-251926]           age: 10 [   2 entries]
Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[251927-251928]           age:  9 [   2 entries]
Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[251929-251931]           age:  8 [   2 entries]
Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[251932-251936]           age:  7 [   2 entries]
Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[251940-252185]           age:  3 [   2 entries]
Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[252186-252186]           age:  2 [  13 entries]
Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[252186-255357]           age:  2 [   2 entries]
Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[254915-9223372036854775807] age:  0 [   2 entries]
Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[255358-255364]           age:  1 [  13 entries]
Unhandled: 3 : ConditionsPool:+++ Remove Conditions for pool with IOV: run(0):[255365-9223372036854775807] age:  0 [  13 entries]
ToolSvc                                    INFO Removing all tools created by ToolSvc
ApplicationMgr                             INFO Application Manager Finalized successfully
ApplicationMgr                             INFO Application Manager Terminated successfully

Are these Unhandled messages expected ? Not a big deal but are a little ugly..

The clean up logic leading to a segfault was too aggressive, but the change I made to fix it is not perfect (it leaks a bit). Moreover I believe the clean up in finalize is reporting that we didn't do the piecemeal clean up.

I think I know how to address these problems.

mentioned in merge request !3932 (merged)

mentioned in commit Detector@5fc42c9f

closed with merge request Detector!339 (merged)

Strictly speaking !3932 (merged) also needs too be merged before this is fully resolved.

Memory leak with dd4hep condition updates

Child items 0

Activity