Skip to content

HTCondor updates

Pieter David requested to merge piedavid/bamboo:htcondor_update_cmsdas into master

Closes #40 (closed)

The upcoming CMSDAS gave a reason to properly check this on the CERN batch system (hence the update to the example config file). The most annoying part was properly detecting failed jobs (it turns out that success_exit_code needs to be set to the default 0 to make sure this works). I also set the HTCondor retries to 0, because just resubmitting with the same settings almost never solves the problem (the default is 3, if the jobs get queued this actually adds unnecessary delay before a manual resubmit).

There's no easy way to resubmit a few jobs with "raw" HTCondor commands, so there's now a small command-line script bambooHTCondorResubmit to do that (stealing the solution from https://github.com/cp3-llbb/CommonTools/blob/master/scripts/checkJobsAndResubmit.py ;-) ).

I tried with a random failure in definePlots, got the list of failing job IDs, ran the resubmit command after removing the failure, and recovered with --distributed=finalize, I think that covers the main parts.

Merge request reports