HltControlFlowMgr retries producers that fail
Investigating some unrelated problem we stumbled on a weird behaviour of HltControlFlowMgr: when the producer for some data fails, it gets executed again for every consumer of the product.
The attached configuration (reproducer.py), to be invoked with gaudirun.py ./reproducer.py:config
, consists of one producer and multiple consumers of the product. The producer operator()
just throws an exception, so the execution should stop, but the algorithm i retried multiple times:
$ gaudirun.py ./reproducer.py:config
[...]
producer ERROR producer : failing by construction
producer ERROR Maximum number of errors ( 'ErrorMax':1) reached.
HLTControlFlowMgr FATAL Event failed in Node Gaudi__Examples__IntDataConsumer/consumer0 : Error in algorithm execute
producer ERROR producer : failing by construction
producer ERROR Maximum number of errors ( 'ErrorMax':1) reached.
HLTControlFlowMgr FATAL Event failed in Node Gaudi__Examples__IntDataConsumer/consumer1 : Error in algorithm execute
producer ERROR producer : failing by construction
producer ERROR Maximum number of errors ( 'ErrorMax':1) reached.
HLTControlFlowMgr FATAL Event failed in Node Gaudi__Examples__IntDataConsumer/consumer2 : Error in algorithm execute
producer ERROR producer : failing by construction
producer ERROR Maximum number of errors ( 'ErrorMax':1) reached.
HLTControlFlowMgr FATAL Event failed in Node Gaudi__Examples__IntDataConsumer/consumer3 : Error in algorithm execute
producer ERROR producer : failing by construction
producer ERROR Maximum number of errors ( 'ErrorMax':1) reached.
HLTControlFlowMgr FATAL Event failed in Node Gaudi__Examples__IntDataConsumer/consumer4 : Error in algorithm execute
producer ERROR producer : failing by construction
producer ERROR Maximum number of errors ( 'ErrorMax':1) reached.
HLTControlFlowMgr FATAL Event failed in Node FailingIntProducer/producer : Error in algorithm execute
HLTControlFlowMgr FATAL *** Event 0 on slot 0 failed! ***
[...]
ApplicationMgr ERROR Application Manager Terminated with error code 3
What we understood is that in BasicNode::execute
when the algorithm fails it is not recorded in AlgoStates
as executed, but failed (also because AlgoStates
can only record 2 bool, one for executed and the other for filter pass).