WIP: Current status of the Calibration system.
Current status
Works:
- General workflow of user starts, service submits, workers work and then poll until the next step starts
- Running Marlin locally doesn't throw an error any more
- Most of the tests
Does not work:
- Implementation of the service-side part of the algorithm largely missing. Need to move large parts from the CalibrationScript.py (which can later be deleted) into the service. Check the original Calibrate.sh (or Dev_calibrate.sh), the blocks that call condorSupervisor*.sh on the Marlin Runfile should be executed on the worker nodes, the rest on the service. Replace the condor calls with DIRAC equivalents (submit_job just means increasing the (step counter)/(phase counter by one and setting step counter to 0), transferring files can be done by uploading them to the grid and referring to them via lfn, ...)
- Some strategies are not chosen yet or have completely guessed defaults (like when to resubmit, etc.)
- Failure recovery is not always as desired. E.g. if submitting one initial job fails, what should the service do? The others will still run
- Some of the tests weren't updated to newer implementations or should temporarily fail (e.g. the change that reconstruction is over after 1 step instead of 15 breaks tests)
- Not perfectly clear how the user can change minor parameters like digitsationAccuracy, kaonLEnergies, and some other default values. Maybe give the Client a readConfiguration() method, that takes a user written file with all the parameters and must be called before starting a calibration?
Most important is getting the Service to distribute the parameters and input files, then one can start testing on the grid.