Implement error detection for Hive scheduler/eventloopmgr (!706) · Merge requests · Gaudi / Gaudi

Previously the AvalancheScheduler and HiveSlimEventLoopMgr would abort the job if an algorithm returned an error (or an event stalled). For online applications this is usually not the correct behavior and one should rather continue with the next event. After event-wise stall detection has been implemented in !690 (merged) by @ishapova , this becomes now a rather trivial change:

Add an AbortOnFailure property to HiveSlimEventLoopMgr (true by default) to toggle the behavior
Allow an algorithm to be in ERROR state in addition to EVTACCAPTED/EVTREJECTED
Do not immediately abort the event processing on failure, but rather rely on the stall detection to abort the event. This has the benefit that the maximum possible graph is executed.
Add a unit test and a feature to simulate ERRORs in CPUCruncher

Implement error detection for Hive scheduler/eventloopmgr

Merge request reports