Skip to content

Gracefully handle broken worker node images.

Created by: bbockelm

If the worker node breaks between the last run of the validation script and the current job startup, then any job startup fails quite quickly. This adds a 20 minute sleep to the job startup; the job still fails, but the slot won't be able to startup the subsequent job.

I also include reasonable log messages in the job's stdout.

Finally, we make the wrapper script start inside CVMFS; this helps prevent autofs from becoming confused and applying an idle-timeout to the singularity container.

I view this PR as a useful improvement, but not critical. I don't want to tie it to the other Singularity-related items in the queue. Let's get it reviewed and queued up -- but it doesn't need to go in right away.

Merge request reports