Scheduler should handle 'pathological events' appropriately

mentioned in issue Moore#318

Apart from the fact that exceptions never bubble up higher than the first sysExecute (poor design choice, I believe), I'm not sure I'm ready to trust our applications to be able to continue processing after an exception as if nothing happened. Such behaviour is technically possible but it require discipline we lack of.

If we assume the application will produce sensible results after an exception or a StatusCode::FAILURE, it should be enough to teach the scheduler to divert failed events to a dedicated filter/stream and continue business as usual.

I think in the scenario where an exception is thrown we don't have to rely on the application producing sensible results, as I think all we need in this case is to save the basic raw event object, i.e. we should just reject whatever else is in the TES for that event. The idea is just to save the raw event to the whatever stream we use for these events, such that the processing can be rerun offline by experts to investigate what happened.

So all (I think ?) we need the scheduler to do in these cases is capture the event where this happens, send it to a special stream (thread ?) in the scheduler which is configured just to save the hardware raw events, and nothing else.

In the run-up to run1 this was discussed as well, and then the conclusion was indeed, for the same reasons, that we cannot trust the state of the executable anymore, and just kill it off and restart it... But the difference is now that due to multi-threading, one process is a larger fraction of the resources which makes it a bit more painful to kill & restart it, but on the other hand, there is much less mutating state, so we should be in a better position than back then to continue after an exception. Now, the problem is of course that 'better' is a relative statement, and there is no easy way to figure out whether 'better' is actually 'good enough'... (as the previous state was clearly far away from good enough).

But even if we do not trust things to be OK, then at least this should still bubble up to the scheduler, it should flag all events it is processing, and notify the DAQ about the fact that it should no longer continue, and then shut down processing. So even then, in the end, the only real difference is 'what does the scheduler do next' and it still has to recognise it got into trouble...

I think even if we ultimately decide we do not trust the state of the application enough to want to continuing processing, and thus need to shutdown the process (and thus allow the online systems to restart it) unless we have a way of capturing and saving the events that caused this we will never really be able to investigate what happened.

TO be discussed with @frankm, but it should be feasible to send raw events that resulted in a StatusCode::FAILURE to a quarantine storage, and with gaudi/Gaudi!966 (merged) we do not even have to touch the scheduler as the status code of the event is passed to the online application (unless the scheduler decides to do something silly as terminate).

About how to treat the other events processed by the same process, we have to decide if we want to believe that the failing event did not interfere with them or we want to play it safe and flag them in some way. In any case the process should be drained and restarted.

Looking at the scheduler code, I was expecting to see something which pops 'things' (probably EventContext) from an input queue, processes them, and then pushes them to an output queue. Now, I do recognise it 'popping' things from the input queue in ExecuteEvent as expected, but I don't quite see how it pushes things to the output queue. Also, would the IQueueingEventProcessor not need some functionality beyond push, pop and empty to notify 'stop trying to pop, I'm in trouble, and will never ever respond anymore to any pop request with anything else than std::nullopt' -- i.e. some 'out of channel' communication to indicate failure on the other side of queue... similar, push is a blocking operation, which is fine, but perhaps it should not be void but allow for a 'failure' return which says: 'thank you for pushing, but I am not in a state to do anything -- please stop pushing!'

Hmm -- of course the alternative is that the scheduler is the queue (as opposed to move things from one queue to another) and that one pushes things to the scheduler, which at its leisure consumes things, and once done, makes them available for pop again. So push is the input queue, and pop is the output queue. And then in case of failure, the scheduler returns a 'failed' status code in conjunction with the event context from pop. So in this setup, everything 'pushed' will be 'popped' later, but it may come with a failure attached to it.... (now, internally, the scheduler thus has to deal with two queues to be able to serve the external 'push' requests and populate the 'pop' queue, but no need to expose that to the outside world)

Now that you say it, I remember mentioning that the scheduler doesn't need an (explicit) input queue, as it is itself a queue. It needs a queue to store the results to, so that the main process can get the results from it.

About blocking and non-blocking, I remember that at some point I realized at least one of the two sides had to be non-blocking, so I decided that the push side could be blocking as you may run out of resources, then I made the pop side always non-blocking (probably I should have called it try_pop to make it more explicit).

Talking about a side channel for hard problems, I would say that it's implementation specific and exceptions seem a valid option in that case, for example if the system is down I can expect push to throw.

We can discuss extensions and changes to IQueueingEventProcessor, but I think it's a bit off-topic here.

Just to come back to this @graven @clemenci do you think it is possible here to come up with some concrete tasks of things that can be worked on to start to add the functionality needed to support this, even if at this point we cannot say exactly what the correct response of the scheduler is to an exception ? It would be good to start to have some explicit tasks which people could in principle start to work on.

Seems to me regardless of exactly what we want the scheduler to do when an exception is thrown, the first step is to see if the scheduler can get to see them at all (i.e. they aren't caught elsewhere like sysExecute) and then start to add options on what to do when they occur ?

Just to come back to this, are there any specific tasks we could get people to do right now ? E.g. improve how the scheduler handles exceptions when they are thrown (which might include preventing sysExcute() getting in the way of them) ??

added enhancement label

removed enhancement label

A couple comments from the RTA coord. meeting this morning.

There appears to be two (more?), somewhat different, tasks related to this.
1. Changes to the scheduler/Gaudi code to correctly intercept and handle exceptions. As @graven put it 'probably about a weeks work for someone familiar with Gaudi and the scheduler , a lot more for someone who is not'
2. Changes on the online side to correctly handle the new stream there.
The discussion here as mostly been around HLT2 so far, but HLT1 also needs something similar, and perhaps (@gligorov) is more critical at this point for the upcoming beam tests.

cc @smenzeme

cc @raaij @mfontana for what concerns beam-test related things for pathological events

mentioned in merge request !4248 (merged)

Scheduler should handle 'pathological events' appropriately

Child items 0

Activity