Better handling of failed jobs (for HTCondor)
When only a few batch jobs fail (e.g. problem accessing the file due to slow filesystem) it can be quite tedious to
- find out which have failed,
- resubmit only those, and
-
hadd
the results manually before re-running the postprocessing step (already nice that it can be run separately!).
The slurm_resubmit
script of CP3SlurmUtils help a lot for the first two steps. Perhaps it could be integrated with the batch driver to automatically resubmit failed jobs a given number of times before moving on to the hadd
-ing step.
Also needed: catching Ctrl+C
when the driver is monitoring jobs, to cancel the running jobs.
(related to #38 (closed) )
EDIT: done for slurm, needs to be done for HTCondor
Edited by Sebastien Wertz