Skip to content

Draft: Refactoring Code Base and adding Features

This merge request aims to fix several issues, make the code more intuitive, and add new features.

Features:

  • adding new datasets and samples with restart of the analysis
    • [Done]the Task DatasetsStatus compares the datasets_status contents with the datasets.yaml configured by the user
  • deletion of ntuples, which are not used due to changes in config. Then they will be submitted again with updated config
    • [Done]samples/datasets will be marked for deletion and will be deleted in a following task
  • Added wrapper task SkimDatasets (renaming soon)
    • [Done]This task will run the data preparation (NTuple) tasks of the analysis

Fixed Issues:

  • the task CheckSample will not be called seemingly infinite times
    • [DONE]solved trough refactoring CheckSample
  • safeguard for datasets_status file
    • [WIP]Task DatasetsStatus is now an external task, therefore law will not delete the output
  • Fix job status information not being accessible for long runs (longer than 2 weeks) breaks the analysis
    • [WIP]Fixed by checking the datasets_status file if sample is already done before attempting to search for status in the ´RetrieveJobStatus` log where it will not be accessible anymore after 2 weeks. Issues with edge cases for jobs with long processing time still remains.
  • Race conditions in checkloop. The UpdateDatasetsStatus task rewrites the datasets_status file, while at the same time the sample updates also try to merge into to the datasets_file. Previously the problem was dodged by generating temporary files and then renaming them, but it should not have been necessary and the workflow could be linearized so no such problems can arise.
    • [DONE]solved through refactoring CheckSample and linearizing the tasks instead of parallelising UpdateDatasetsStatus and CheckSample
  • Issue with tasks counted as complete, but actually in failed state
    • [WIP]solved through refactoring

Convenience changes:

  • [WIP]terminal printouts are reduced and manually configured with individual logging levels per task.
  • [Done]analysis parameter are not required to be specified for task execution in cli - all tasks can now access the config and read the parameters from it
  • [DONE]The Task CheckSample is now split up into a looptask, PeriodicSampleUpdating, and a chain of tasks called by the looptask; RetrieveJobStatus, UpdateSampleStatusAndProgress, RetryJob, DownloadSample UpdateDatasetsStatusFile. This should make the code more readable and resolve the weirdly interacting parallel running loops for CheckSample and UpdateDatasetsStatusFile.

Renaming:

  • checkloop -> data_preparation
  • processloop -> processing_data
  • CreateStatusFile -> DatasetsStatus
Edited by Malte Hoja

Merge request reports

Loading