EventView bug with very specific control flow structure.
With the following CF structure (this is the minimum to demonstrate the problem) a crash can occur with EventView scheduling:
* hltSteps
- filter_alg
- filter_alg2
* view_make_node
- filter_alg
- view_make_alg
* view_test_node
- view_test_alg
Necessary conditions:
- hltSteps must have StopOverride = False (other flags don't alter the problem, even though hltSteps is set as Sequential)
- filter_alg must pass
- filter_alg2 must fail
- view_make_alg must create at least 1 view
@ssnyder noticed that this was due to an algorithm (view_test_alg) running in a view while the event was already considered finished by the scheduler: see discussion here https://its.cern.ch/jira/browse/ATR-17940 assuming it's visible outside of ATLAS.
The re-use of filter_alg allows the graph traversal to "jump past" filter_alg2 and start scheduling other algorithms. Then, when filter_alg2 fails the event ends, but there is currently no protection for this possibility in view scheduling and a crash occurs.
Scott suggested the following fix:
diff --git a/GaudiHive/src/AvalancheSchedulerSvc.cpp b/GaudiHive/src/AvalancheSchedulerSvc.cpp
index bae703cff..35858f2c1 100644
--- a/GaudiHive/src/AvalancheSchedulerSvc.cpp
+++ b/GaudiHive/src/AvalancheSchedulerSvc.cpp
@@ -47,6 +47,17 @@ namespace
std::sort( v.begin(), v.end(), DataObjIDSorter() );
return v;
}
+
+ bool subSlotExecuting (const EventSlot& slot)
+ {
+ for (const EventSlot& ss : slot.allSubSlots) {
+ if (ss.algsStates.algsPresent( AlgsExecutionStates::SCHEDULED )) {
+ return true;
+ }
+ }
+ return false;
+ }
+
}
//===========================================================================
@@ -762,6 +773,7 @@ StatusCode AvalancheSchedulerSvc::updateStates( int si, const int algo_index, Ev
// Not complete because this would mean that the slot is already free!
if ( !thisSlot.complete && m_precSvc->CFRulesResolved( thisSlot ) &&
thisSlot.subSlotAlgsReady.empty() && // Account for sub-slot algs
+ !subSlotExecuting( thisSlot ) &&
!thisSlot.algsStates.algsPresent( AlgsExecutionStates::CONTROLREADY ) &&
!thisSlot.algsStates.algsPresent( AlgsExecutionStates::DATAREADY ) &&
!thisSlot.algsStates.algsPresent( AlgsExecutionStates::SCHEDULED ) ) {
@@ -1155,6 +1167,7 @@ StatusCode AvalancheSchedulerSvc::scheduleEventView( EventContext const* sourceC
unsigned int lastIndex = topSlot.allSubSlots.size();
topSlot.allSubSlots.push_back( EventSlot( m_eventSlots[topSlotIndex], viewContext ) );
topSlot.allSubSlots.back().entryPoint = nodeName;
+ topSlot.allSubSlots.back().algsStates.reset();
// Store index of the new slot in lookup structures
topSlot.contextToSlot[viewContext] = lastIndex;
This will fix the symptom, but I wanted to make sure that this is actually the correct behaviour for CF / Graph Traversal. Should it be pre-emptively scheduling algorithms in a Sequential CF structure? Could you take a look please, @ishapova
Regards, Ben