Skip to content

Fix raw data event offsets to fix slice splitting

Roel Aaij requested to merge fix_slice_splitting into master

When a slice is split in response to a failure to reserve device memory, the event interval in the runtime options is uses to indicate which events in the new batch to run over. It's implemented this way because the contents of a slice cannot be modified. The main reason behind that constraint is that in production the memory containing the incoming MEPs is owned by the BufferManager and there is no memory bandwidth to change things on the Allen side.

When a slice is split, the same slice is resubmitted but with different non-overlapping intervals. The event interval only matters when "addressing" raw data fragments. The rest of sequence is self-consistent by design. The event interval was not taken into account properly in any of the decoding algorithms. This is addressed by this MR.

A side effect of the slice splitting is that any validation or monitoring that is sequenced before an algorithm that fails to reserve memory will double count that batch as they have already run when the exception is thrown.

To resolve this for the validators, all validation algorithms should be sequenced after any algorithms that may fail to allocate device memory. Looking through the configuration they are part of their own CompositeNode, but in the sequence they seem to only be constrained by their data flow even when ForceOrder=True is set on the parent node.

To avoid double counting in the monitoring, any filling of (Gaudi) histograms would have to be done at the end of the sequence. A partial workaround may be to flag a batch as "aborted" once the exception is caught, such that the aggregator can skip that batch. Since the aggregator is on a timer this may not fully cover things.

I would propose the address the double counting in the validators in this MR, but leave the monitoring issue for later improvement.

/cc @gligorov @cagapopo

Edited by Roel Aaij

Merge request reports