Skip to content

Fix issue with procid overwriting ProcId during the HTCHandler submission

Fix issue with procid overwriting ProcId during the submission because there the variables do not seem to be case sensitive.

ProcId is the condor ProcId. prodid is the JobCtrl job ID.

For the first submission they are identical but for resubmissions, when only a fraction of the jobs failed and needs resubmitting, they are different.

The behaviour observed is that the htc-id field starts for example as 5666427.0 when the job is resubmitted and is in idle status. But after the status changes to running the htc-id also changes to e.g. 5666427.36, which is not expected.

So the real condor ClusterID.ProcId combination for the running resubmitted job is 5666427.0 while the htc-id field in the DB is set to 5666427.36.

This is not a problem as long as the job finishes before the next resubmit check. Because then check_running_job(jid = jctrl.getJob(jid=jid, last=True)[-1]['htc-id']) looks if there is a job with htc-id on condor and if not it sets the job in the DB to failed.

So one ends up with a running job on condor that will finish or fail at some point. And in addition resubmit submits a new job to condor because it thinks the original one is not on condor anymore.

This MR fixes this by renaming procid to jobid

Merge request reports

Loading