OOP revamp of task list management, timeline output
Rewrite the tasklist management functions using object oriented programming. This should make it easier to understand the different states and fix random locking bugs. The main changes are:
-
TaskList
class represents all files related to a task list (toprocess, finished, failed, lock). -
TaskList
has clearly defined process-safe function for common tasks:- Getting the next available task.
- Recording finished tasks.
- Returning an processed task that was terminated early.
- A single lock file per
TaskList
controls access to all files. - Each
executor
is responsible for cleaning itself up on termination. This includes both updating the toprocess list and terminating the actual task. - The top-level process listens for
SIGUSR1
andSIGINT
signals only to terminate workers.
Also added a timeline plot that can be visualized using chrome://tracing. This has the following changes:
- A
tasklist_timeline.json
file is saved after each task completes. Uses theTaskList
locks.- The name is the full command.
- The pid is the PID of the top-level process
- The tid is the PID of the worker process
- Use Pool's
imap_unordered
instead ofmap_async
to be notified of individual task's completion independently - Remove the timeout, since it is not compatible with
imap_unordered
(and I do not understand its purpose) - Remove Pool's
maxtasksperchild
as it messes up the visualization of timeline by adding "extra workers". Also I don't understand the argument that it frees up memory. Completing the task process should do that.
There are now two examples included in the code:
- Running using array jobs
- Running in multi-node setups using
srun
Both example implement correct clean-up procedure before a job is killed to save the progress. This is also a stepping stone to checkpointing.
Edited by Karol Krizka
Merge request reports
Activity
Filter activity
Please register or sign in to reply