Implement error detection for Hive scheduler/eventloopmgr
Previously the AvalancheScheduler and HiveSlimEventLoopMgr would abort the job if an algorithm returned an error (or an event stalled). For online applications this is usually not the correct behavior and one should rather continue with the next event. After event-wise stall detection has been implemented in !690 (merged) by @ishapova , this becomes now a rather trivial change:
- Add an
AbortOnFailure
property to HiveSlimEventLoopMgr (true by default) to toggle the behavior - Allow an algorithm to be in
ERROR
state in addition toEVTACCAPTED
/EVTREJECTED
- Do not immediately abort the event processing on failure, but rather rely on the stall detection to abort the event. This has the benefit that the maximum possible graph is executed.
- Add a unit test and a feature to simulate ERRORs in CPUCruncher