Batch Submission to HTCondor and the Like
Description
When running extensive simulations, it is desirable to be able to submit them as jobs to a batch system. Several batch systems are used in our eco system, most notable "the grid", with job submission through e.g. Dirac, the NAF system at DESY (Using Sun's N1 Grid Engine), the HTCondor batch system on CERN's LXPLUS, to name a few.
Despite the simplicity of submitting jobs to e.g. HTCondor (one call to condor_submit
with a job description file containing a few parameters) it would be nice of Allpix Squared to somehow support the submission to ease the use.
Proposal
We should discuss with an expert on the best route to take. I would probably be in favor of an implementation that goes a bit beyond providing a shell script to call. I would love to see a separate executable, e.g. condor_allpix
which would take the same arguments as the regular allpix
executable but would submit the jobs directly to the batch system.
This could even go one step further and for example figure out optimal job parameters such as the number of CPUs to request based on the number of detectors simulated, the total job time based on the number of events simulated, etc.
There seems to be a sort-of standardization for interfacing some batch submission systems, DRMAA, which could be something interesting to look at.
Links / references
- HTCondor @ CERN - http://batchdocs.web.cern.ch/batchdocs/index.html
- HTCondor Job Submission - http://batchdocs.web.cern.ch/batchdocs/local/submit.html
- NAF @ DESY - https://it.desy.de/dienste/computing_infrastruktur/bird_cluster_allgemeine_batch_farm_englisch/index_ger.html
- DRMAA - http://www.drmaa.org/
- DIRAC - http://diracgrid.org/
- iLCDirac - https://twiki.cern.ch/twiki/bin/view/CLIC/DiracForUsers , https://twiki.cern.ch/twiki/bin/view/CLIC/DiracUsage
Other (small) frameworks mostly provide wrapper tools for the submission:
- EUTelescope's jobsub: https://github.com/eutelescope/eutelescope/tree/master/jobsub (implemented by me, submission to NAF and the old LXPLUS batch)
Use cases
Large simulations, mostly when going into "production" after testing settings with low statistics runs. Also for very elaborate simulations which require lots of CPU power (run slow - split jobs and merge results/output data)
Feature checklist
Make sure these are completed before closing the issue, with a link to the relevant commit.
-
Evaluate what the best method of supporting batch systems and grid submission would be -
Implement selected method for submitting jobs -
Add description how to submit jobs to README.md -
Documentation -
Covered by test cases