Skip to content

Fix thread-safety problem in GaudiHandle

Scott Snyder requested to merge ssnyder/Gaudi:mtHandle-20200611 into master

Fix thread-safety problem in GaudiHandle that was observed to cause a rare crash in Athena.

The component retrieval code in GaudiHandle can lead to operator->() returning a null pointer for a valid handle, if it is called simultaneously from multiple threads.

The relevant code in GaudiHandle is:

  T* operator->() {
    assertObject();
    return m_pObject;
  }


  void assertObject() const {
    if ( UNLIKELY( !isValid() ) ) {
       throw GaudiException ...
    }
  }


  bool isValid() const {
    // not really const, because it may update m_pObject
    return m_pObject || retrieve().isSuccess();
  }


  StatusCode retrieve() const {
    // not really const, because it updates m_pObject
    StatusCode sc = StatusCode::SUCCESS;
    if ( m_pObject && release().isFailure() ) { sc = StatusCode::FAILURE; }
    if ( sc && retrieve( m_pObject ).isFailure() ) {
      m_pObject = nullptr;
      sc        = StatusCode::FAILURE;
    }
    return sc;
  }

Note that if the pointer is already set when retrieve() is called, it clears it and redoes the lookup.

So we could have:

  thread 1 calls operator->.  m_pObject is not set, so it executes
  through to retrieve() after the test on m_pObject.

  thread 2 calls operator->.  m_pObject is not set, so it executes
  through to the start of retrieve().

  thread 1 continues, does the lookup, sets m_pObject and returns.

  thread 2 continues, tests m_pObject, finds it set, and calls reset()
  to clear it.

  thread 1 continues, returning through operator->.  m_pObject is now
  null, so it returns null.

  thread 2 continues with the retrieve, and returns a valid pointer
  from operator->.

I've tentatively tried to fix this by making m_pObject atomic, and changing retrieve() so that it will not change m_pObject if it is already set. I think that's consistent with what's implied by retrieve() being const. ATLAS software seems fine with this change, but this is, however, a potential change of behavior, so i point it out here.

Edited by Scott Snyder

Merge request reports