Possible race-conditions in HTCHandler.submit and HTCHandler.resubmit
I found a few problems/bugs while debugging the PPS Automation, some of which are prone to 'race-conditions' (job status changed by HTCondor while the JobCtrl is acting):
- In TaskHandlers.py#L712 we allow the submission of jobs for tasks in the re-processing status even if there are still previous jobs in 'idle' state (we just forbid it if they are running). This check seems error-prone/bugged to me because of the following reasons:
- The jobs associated to the pre-reprocessing state are not killed. E.g. if the job with jid=0 was in 'idle' while the task is marked for reprocessing, a new job with jid=0 will be submitted to condor. The previous one might then run and interact with the job record in InfluxDB, altering the processing of the one submitted after.
- The check for running jobs is performed sequentially with
condor_q
commands. This takes time and jobs that are not running might start running after they passed the check and while the check is still ongoing, defeating the purpose of the check - TaskHandlers.py#L712 the check_running_job function is called using the 'jid', however it's definition at TaskHandlers.py#L684 asks for the HTCondor ID.
- I think we should forbid the submission unless all the jobs are done or failed. This can be achieved by editing TaskHandlers.py#L711-712 as follows:
if len(jobs['idle]+jobs['running']) > 0: allow_repr = False
- In TaskHandlers.py#L776-778 there's an unintended behaviour. When the job is submitted for the first time (submit method), it has no htc-id field. If the resubmit method is called, the linked code will check (as the comment says) if a job marked as running or idle in the db is indeed running in condor.
-
TaskHandlers.py#L777 returns
False
if the job is in theidle
state. This causes the job to be marked as failed, even if it has not run (breaking the check). - In the following
if
statement TaskHandlers.py#L781, these jobs will be re-submitted, causing the number of jobs to increase at every 'resubmit' call.
-
TaskHandlers.py#L777 returns
I'm preparing (and testing) a commit to fix these two issues