Handling of condor jobs with (ExitBySignal == false) && (ExitCode != 0)
Running a sample production with HiggsDNA I realized that I got some of my jobs in "DONE" state. For those jobs I got neither an output file (.parquet) nor the log files (.err, .out). I tried to investigate the reason why this happens and I noticed that the jobs have the following lined in the configuration:
Err = ifThenElse(JobStatus == 4,"/dev/null",SubmittedErr)
Out = ifThenElse(JobStatus == 4,"/dev/null",SubmittedOut)
OutputDestination = ifThenElse(JobStatus == 4,undefined,SubmittedOutputDestination)
where JobStatus == 4 means "Completed", i.e. the jobs is labeled as "DONE". This explains why I get no files as output. The sub configuration that HiggsDNA builds has the line:
output_destination = root://eosuser.cern.ch//eos/user/...
so the extra condition based on the status of the job must be added directly by condor, and I couldn't find a way to change it.
However, I tried to understand why only some of the jobs go to "DONE", instead of "HOLD" (and then get removed after failing 3 times and returning the log files) or get removed (if they finish successfully, returning both the output file and the log files), according to this line:
OnExitRemove = NumJobCompletions > JobMaxRetries || ExitCode =?= 0 || (ExitBySignal == false) && (ExitCode == 0)
It seems to me like the line above, in combination with the following line in the HiggsDNA .sub config:
OnExitHold = (ExitBySignal == true) && (ExitCode != 0)
fail to catch the case where (ExitBySignal == false) && (ExitCode != 0), resulting in them going in the "Completed" state, not returning anything. This PR changes this behavior so that the last case also goes on "HOLD" and then ending the normal way, returning the log files.