Skip to content

Rework Queue Cleanup Runner implementation

This issue is motivated by a series of incident we started seeing in production where the entire system becomes unresponsive https://gitlab.cern.ch/cta/operations/-/issues/719. The problem has been understood and we have an stable solution . All details of the investigation here: https://codimd.web.cern.ch/fWvlG7_fSsKi8uAyXd7H4g

To simplify the review process I have split the MR into different ones to isolate different concerns. List of associated MR to this ticket:

  • Logging Improvements.
  • Agent Queue reservation.
  • Job batching.
  • FIFO Job insertion.

After seeing the stress tests results it has been decided that we will make no modifications to the job insertion logic nor create a dedicated method to insert in FIFO style.

Results without FIFO insertion for ~70k jobs (818 secs // 13.63 min): newplot_14_

Results with FIFO insertion for ~70k jobs (785 secs // 13.08 min): image

In general the performance is the same. The +20second spikes is something yet to be understood, we cannot conclude from two runs that the FIFO insertion makes the spikes happen less frequently but it is something to consider in case this is to be investigated in the future.

Edited by Pablo Oliver Cortes