Skip to content
  • Scott Snyder's avatar
    Fix thread-safety problem in GaudiHandle that was observed to cause a rare crash in Athena. · 4f5d771d
    Scott Snyder authored and Marco Clemencic's avatar Marco Clemencic committed
    The component retrieval code in GaudiHandle can lead to operator->()
    returning a null pointer for a valid handle, if it is called
    simultaneously from multiple threads.
    
    The relevant code in GaudiHandle is:
    
    ```
      T* operator->() {
        assertObject();
        return m_pObject;
      }
    
      void assertObject() const {
        if ( UNLIKELY( !isValid() ) ) {
           throw GaudiException ...
        }
      }
    
      bool isValid() const {
        // not really const, because it may update m_pObject
        return m_pObject || retrieve().isSuccess();
      }
    
      StatusCode retrieve() const {
        // not really const, because it updates m_pObject
        StatusCode sc = StatusCode::SUCCESS;
        if ( m_pObject && release().isFailure() ) { sc = StatusCode::FAILURE; }
        if ( sc && retrieve( m_pObject ).isFailure() ) {
          m_pObject = nullptr;
          sc        = StatusCode::FAILURE;
        }
        return sc;
      }
    ```
    
    Note that if the pointer is already set when retrieve() is called, it
    clears it and redoes the lookup.
    
    So we could have:
    
    ```
      thread 1 calls operator->.  m_pObject is not set, so it executes
      through to retrieve() after the test on m_pObject.
    
      thread 2 calls operator->.  m_pObject is not set, so it executes
      through to the start of retrieve().
    
      thread 1 continues, does the lookup, sets m_pObject and returns.
    
      thread 2 continues, tests m_pObject, finds it set, and calls reset()
      to clear it.
    
      thread 1 continues, returning through operator->.  m_pObject is now
      null, so it returns null.
    
      thread 2 continues with the retrieve, and returns a valid pointer
      from operator->.
      ```
    
    I've tentatively tried to fix this by making m_pObject atomic, and
    changing retrieve() so that it will not change m_pObject if it is
    already set.  I think that's consistent with what's implied
    by retrieve() being const.  ATLAS software seems fine with this change,
    but this is, however, a potential change of behavior, so i point
    it out here.
    4f5d771d