Skip to content

WIP: Current status of the Calibration system.

Current status

Works:

  • General workflow of user starts, service submits, workers work and then poll until the next step starts
  • Running Marlin locally doesn't throw an error any more
  • Most of the tests

Does not work:

  • Implementation of the service-side part of the algorithm largely missing. Need to move large parts from the CalibrationScript.py (which can later be deleted) into the service. Check the original Calibrate.sh (or Dev_calibrate.sh), the blocks that call condorSupervisor*.sh on the Marlin Runfile should be executed on the worker nodes, the rest on the service. Replace the condor calls with DIRAC equivalents (submit_job just means increasing the (step counter)/(phase counter by one and setting step counter to 0), transferring files can be done by uploading them to the grid and referring to them via lfn, ...)
  • Some strategies are not chosen yet or have completely guessed defaults (like when to resubmit, etc.)
  • Failure recovery is not always as desired. E.g. if submitting one initial job fails, what should the service do? The others will still run
  • Some of the tests weren't updated to newer implementations or should temporarily fail (e.g. the change that reconstruction is over after 1 step instead of 15 breaks tests)
  • Not perfectly clear how the user can change minor parameters like digitsationAccuracy, kaonLEnergies, and some other default values. Maybe give the Client a readConfiguration() method, that takes a user written file with all the parameters and must be called before starting a calibration?

Most important is getting the Service to distribute the parameters and input files, then one can start testing on the grid.

Merge request reports