Catch failed jobs, output recovery actions
Addresses #40 (closed)
I've only done the changes for the slurm backend (because I could test it easily), it will have to be implemented for HTCondor later.
In the batch part:
- Collect tasks for which some jobs fail
- Do not perform task finalization action if some jobs failed
- When everything is done running (completed or failed), print failed commands, give path to log files.
- Print a suggestion of command that the user can run to resubmit the failed jobs.
- Configure monitoring waiting time in the
ini
file - Do
hadd
with-f
option: when running once with--maxFiles 1
to check if everything is fine, and then running again using--distributed driver
, the finalization will fail because the result files will already exist (avoids having to remove them every time). - Catch SIGINT (CTRL+C) when jobs are being monitored, and ask the user if he wants to cancel the running jobs.
In the analysis module:
- Tell the user what finalization commands (i.e. hadd) to run once all jobs have finished successfully.
- Do not run postprocessing if any task failed.
Edited by Sebastien Wertz