Skip to content

OOP revamp of task list management, timeline output

Karol Krizka requested to merge revamp into master

Rewrite the tasklist management functions using object oriented programming. This should make it easier to understand the different states and fix random locking bugs. The main changes are:

  • TaskList class represents all files related to a task list (toprocess, finished, failed, lock).
  • TaskList has clearly defined process-safe function for common tasks:
    • Getting the next available task.
    • Recording finished tasks.
    • Returning an processed task that was terminated early.
  • A single lock file per TaskList controls access to all files.
  • Each executor is responsible for cleaning itself up on termination. This includes both updating the toprocess list and terminating the actual task.
  • The top-level process listens for SIGUSR1 and SIGINT signals only to terminate workers.

Also added a timeline plot that can be visualized using chrome://tracing. This has the following changes:

  • A tasklist_timeline.json file is saved after each task completes. Uses the TaskList locks.
    • The name is the full command.
    • The pid is the PID of the top-level process
    • The tid is the PID of the worker process
  • Use Pool's imap_unordered instead of map_async to be notified of individual task's completion independently
  • Remove the timeout, since it is not compatible with imap_unordered (and I do not understand its purpose)
  • Remove Pool's maxtasksperchild as it messes up the visualization of timeline by adding "extra workers". Also I don't understand the argument that it frees up memory. Completing the task process should do that.

There are now two examples included in the code:

  • Running using array jobs
  • Running in multi-node setups using srun Both example implement correct clean-up procedure before a job is killed to save the progress. This is also a stepping stone to checkpointing.
Edited by Karol Krizka

Merge request reports