Skip to content

Catch failed jobs, output recovery actions

Sebastien Wertz requested to merge swertz/bamboo:catchFailedJobs into master

Addresses #40 (closed)

I've only done the changes for the slurm backend (because I could test it easily), it will have to be implemented for HTCondor later.

In the batch part:

  • Collect tasks for which some jobs fail
  • Do not perform task finalization action if some jobs failed
  • When everything is done running (completed or failed), print failed commands, give path to log files.
  • Print a suggestion of command that the user can run to resubmit the failed jobs.
  • Configure monitoring waiting time in the ini file
  • Do hadd with -f option: when running once with --maxFiles 1 to check if everything is fine, and then running again using --distributed driver, the finalization will fail because the result files will already exist (avoids having to remove them every time).
  • Catch SIGINT (CTRL+C) when jobs are being monitored, and ask the user if he wants to cancel the running jobs.

In the analysis module:

  • Tell the user what finalization commands (i.e. hadd) to run once all jobs have finished successfully.
  • Do not run postprocessing if any task failed.
Edited by Sebastien Wertz

Merge request reports