Catch failed jobs, output recovery actions (!42) · Merge requests · cms-analysis / General / bamboo

I've only done the changes for the slurm backend (because I could test it easily), it will have to be implemented for HTCondor later.

In the batch part:

Collect tasks for which some jobs fail
Do not perform task finalization action if some jobs failed
When everything is done running (completed or failed), print failed commands, give path to log files.
Print a suggestion of command that the user can run to resubmit the failed jobs.
Configure monitoring waiting time in the ini file
Do hadd with -f option: when running once with --maxFiles 1 to check if everything is fine, and then running again using --distributed driver, the finalization will fail because the result files will already exist (avoids having to remove them every time).
Catch SIGINT (CTRL+C) when jobs are being monitored, and ask the user if he wants to cancel the running jobs.

In the analysis module:

Tell the user what finalization commands (i.e. hadd) to run once all jobs have finished successfully.
Do not run postprocessing if any task failed.

Edited Sep 25, 2019 by Sebastien Wertz

Admin message