Skip to content

EventView bug with very specific control flow structure.

With the following CF structure (this is the minimum to demonstrate the problem) a crash can occur with EventView scheduling:

    * hltSteps
      - filter_alg
      - filter_alg2
      * view_make_node
        - filter_alg
        - view_make_alg
        * view_test_node
          - view_test_alg

Necessary conditions:

  1. hltSteps must have StopOverride = False (other flags don't alter the problem, even though hltSteps is set as Sequential)
  2. filter_alg must pass
  3. filter_alg2 must fail
  4. view_make_alg must create at least 1 view

@ssnyder noticed that this was due to an algorithm (view_test_alg) running in a view while the event was already considered finished by the scheduler: see discussion here https://its.cern.ch/jira/browse/ATR-17940 assuming it's visible outside of ATLAS.

The re-use of filter_alg allows the graph traversal to "jump past" filter_alg2 and start scheduling other algorithms. Then, when filter_alg2 fails the event ends, but there is currently no protection for this possibility in view scheduling and a crash occurs.

Scott suggested the following fix:

diff --git a/GaudiHive/src/AvalancheSchedulerSvc.cpp b/GaudiHive/src/AvalancheSchedulerSvc.cpp
index bae703cff..35858f2c1 100644
--- a/GaudiHive/src/AvalancheSchedulerSvc.cpp
+++ b/GaudiHive/src/AvalancheSchedulerSvc.cpp
@@ -47,6 +47,17 @@ namespace
     std::sort( v.begin(), v.end(), DataObjIDSorter() );
     return v;
   }
+
+  bool subSlotExecuting (const EventSlot& slot)
+  {
+    for (const EventSlot& ss : slot.allSubSlots) {
+      if (ss.algsStates.algsPresent( AlgsExecutionStates::SCHEDULED )) {
+        return true;
+      }
+    }
+    return false;
+  }
+
 }
 
 //===========================================================================
@@ -762,6 +773,7 @@ StatusCode AvalancheSchedulerSvc::updateStates( int si, const int algo_index, Ev
     // Not complete because this would mean that the slot is already free!
     if ( !thisSlot.complete && m_precSvc->CFRulesResolved( thisSlot ) &&
          thisSlot.subSlotAlgsReady.empty() && // Account for sub-slot algs
+         !subSlotExecuting( thisSlot ) &&
          !thisSlot.algsStates.algsPresent( AlgsExecutionStates::CONTROLREADY ) &&
          !thisSlot.algsStates.algsPresent( AlgsExecutionStates::DATAREADY ) &&
          !thisSlot.algsStates.algsPresent( AlgsExecutionStates::SCHEDULED ) ) {
@@ -1155,6 +1167,7 @@ StatusCode AvalancheSchedulerSvc::scheduleEventView( EventContext const* sourceC
     unsigned int lastIndex = topSlot.allSubSlots.size();
     topSlot.allSubSlots.push_back( EventSlot( m_eventSlots[topSlotIndex], viewContext ) );
     topSlot.allSubSlots.back().entryPoint = nodeName;
+    topSlot.allSubSlots.back().algsStates.reset();
 
     // Store index of the new slot in lookup structures
     topSlot.contextToSlot[viewContext] = lastIndex;

This will fix the symptom, but I wanted to make sure that this is actually the correct behaviour for CF / Graph Traversal. Should it be pre-emptively scheduling algorithms in a Sequential CF structure? Could you take a look please, @ishapova

Regards, Ben

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information