Extend AlgExecState with an "executing" state
The AlgExecState, accessible via the IAlgExecStateSvc, now records if an algorithm is currently being executed. This required some interface changes to the AlgExecState, i.e. replacing the boolean [set/is]Executed() with a enum-based [set]execState(). Clients have been updated accordingly. This also opens the door for further algorithm states if necessary later.
cc @leggett
Merge request reports
Activity
- Resolved by Marco Clemencic
- Resolved by Frank Winklmeier
- [2017-08-04 18:26] Validation started with lhcb-gaudi-merge#185
- [2017-08-05 00:02] Validation started with lhcb-future-clang#272
- [2017-08-05 00:03] Validation started with lhcb-future#492
- [2017-08-06 00:02] Validation started with lhcb-future#493
- [2017-08-06 00:03] Validation started with lhcb-future-clang#273
- [2017-08-07 00:02] Validation started with lhcb-future-clang#274
- [2017-08-07 00:03] Validation started with lhcb-future#494
- [2017-08-08 00:02] Validation started with lhcb-future-clang#275
- [2017-08-08 00:03] Validation started with lhcb-future#495
- [2017-08-09 00:03] Validation started with lhcb-future-clang#276
- [2017-08-09 00:03] Validation started with lhcb-future#496
- [2017-08-10 00:02] Validation started with lhcb-future-clang#277
- [2017-08-10 00:02] Validation started with lhcb-future#497
- [2017-08-11 00:03] Validation started with lhcb-future-clang#278
- [2017-08-11 00:03] Validation started with lhcb-future#498
- [2017-08-12 00:02] Validation started with lhcb-future-clang#279
- [2017-08-12 00:03] Validation started with lhcb-future#499
- [2017-08-13 00:03] Validation started with lhcb-future#500
- [2017-08-13 00:03] Validation started with lhcb-future-clang#280
- [2017-08-14 00:03] Validation started with lhcb-future#501
- [2017-08-14 00:03] Validation started with lhcb-future-clang#281
- [2017-08-14 10:01] Validation started with lhcb-future-clang#281
- [2017-08-14 10:01] Validation started with lhcb-future#501
- [2017-08-15 00:02] Validation started with lhcb-future#502
- [2017-08-15 00:03] Validation started with lhcb-future-clang#282
- [2017-08-16 00:02] Validation started with lhcb-future#503
- [2017-08-16 00:03] Validation started with lhcb-future-clang#283
- [2017-08-17 00:02] Validation started with lhcb-future-clang#284
- [2017-08-17 00:03] Validation started with lhcb-future#504
- [2017-08-18 00:03] Validation started with lhcb-future-clang#285
- [2017-08-18 00:03] Validation started with lhcb-future#505
- [2017-08-19 00:02] Validation started with lhcb-future-clang#286
- [2017-08-19 00:03] Validation started with lhcb-future#506
- [2017-08-20 00:02] Validation started with lhcb-future#507
- [2017-08-20 00:03] Validation started with lhcb-future-clang#287
- [2017-08-21 00:03] Validation started with lhcb-future-clang#288
- [2017-08-21 00:03] Validation started with lhcb-future#508
- [2017-08-22 00:03] Validation started with lhcb-future#509
- [2017-08-22 00:03] Validation started with lhcb-future-clang#289
- [2017-08-23 00:02] Validation started with lhcb-future-clang#290
- [2017-08-23 00:02] Validation started with lhcb-future#510
- [2017-08-24 00:03] Validation started with lhcb-future#511
- [2017-08-24 00:03] Validation started with lhcb-future-clang#291
- [2017-08-25 00:03] Validation started with lhcb-future#512
- [2017-08-25 00:04] Validation started with lhcb-future-clang#292
- [2017-08-26 00:03] Validation started with lhcb-future-clang#293
- [2017-08-26 00:04] Validation started with lhcb-future#513
- [2017-08-27 00:03] Validation started with lhcb-future-clang#294
- [2017-08-27 00:03] Validation started with lhcb-future#514
- [2017-08-28 00:03] Validation started with lhcb-future#515
- [2017-08-28 00:03] Validation started with lhcb-future-clang#295
- [2017-08-29 00:03] Validation started with lhcb-future-clang#296
- [2017-08-30 00:04] Validation started with lhcb-future-clang#297
- [2017-08-30 00:04] Validation started with lhcb-future#517
- [2017-08-30 09:40] Validation started with lhcb-future#518
- [2017-08-31 00:03] Validation started with lhcb-future#519
- [2017-08-31 00:03] Validation started with lhcb-future-clang#298
- [2017-09-01 00:03] Validation started with lhcb-future-clang#299
- [2017-09-01 00:03] Validation started with lhcb-future#520
- [2017-09-01 08:49] Validation started with lhcb-future#521
- [2017-09-02 00:03] Validation started with lhcb-future#522
- [2017-09-02 00:03] Validation started with lhcb-future-clang#300
- [2017-09-03 00:03] Validation started with lhcb-future#523
- [2017-09-03 00:03] Validation started with lhcb-future-clang#301
- [2017-09-04 00:03] Validation started with lhcb-future-clang#302
- [2017-09-04 00:04] Validation started with lhcb-future#524
- [2017-09-05 00:03] Validation started with lhcb-future#525
- [2017-09-05 00:03] Validation started with lhcb-future-clang#303
- [2017-09-06 00:03] Validation started with lhcb-future#526
- [2017-09-06 00:04] Validation started with lhcb-future-clang#304
- [2017-09-07 00:03] Validation started with lhcb-future-clang#305
- [2017-09-07 00:03] Validation started with lhcb-future#527
- [2017-09-08 00:03] Validation started with lhcb-future#528
- [2017-09-08 00:03] Validation started with lhcb-future-clang#306
- [2017-09-09 00:03] Validation started with lhcb-future-clang#307
- [2017-09-09 00:03] Validation started with lhcb-future#529
- [2017-09-10 00:02] Validation started with lhcb-future#530
- [2017-09-10 00:03] Validation started with lhcb-future-clang#308
- [2017-09-11 00:03] Validation started with lhcb-future#531
- [2017-09-11 00:04] Validation started with lhcb-future-clang#309
- [2017-09-12 00:04] Validation started with lhcb-future#532
- [2017-09-12 00:04] Validation started with lhcb-future-clang#310
- [2017-09-13 00:03] Validation started with lhcb-future#533
- [2017-09-13 00:03] Validation started with lhcb-future-clang#311
- [2017-09-14 00:03] Validation started with lhcb-future#534
- [2017-09-14 00:03] Validation started with lhcb-future-clang#312
- [2017-09-15 00:02] Validation started with lhcb-future#535
- [2017-09-15 00:02] Validation started with lhcb-future-clang#313
- [2017-09-16 00:02] Validation started with lhcb-future#536
- [2017-09-16 00:02] Validation started with lhcb-future-clang#314
- [2017-09-17 00:02] Validation started with lhcb-future-clang#315
- [2017-09-17 00:02] Validation started with lhcb-future#537
- [2017-09-18 00:03] Validation started with lhcb-future#538
- [2017-09-18 00:03] Validation started with lhcb-future-clang#316
Edited by Software for LHCb- Resolved by Frank Winklmeier
Just for my understanding, what specifically this new state is needed for?
In general, this is going towards a full-fledged algorithm state tracking service, which I believe is the right way to go. Ultimately, the service would have to absorb all the states of GaudiHive/src/AlgsExecutionStates.cpp.
For now, to make it more generic, may I only suggest removing the "exec" prefix in the namings where it is appropriate? E.g., changing setExecState() to setState(), etc.
Edited by Illya Shapovaladded 1 commit
- 10e46de4 - Rename setExecState() to setState() to prepare for furture extension
The immediate need was to identify the algorithms that are running when e.g. the job crashes or when an external timeout signal is raised. In the single-threaded Gaudi this was possible via the AlgContextSvc (see discussion in GAUDI-1305).
I followed your suggestion and removed the "Exec" where appropriate.
@ishapova, @leggett : While testing this code in some toy example it happens frequently that I find two algorithms marked as "executing" for a given slot. Naively I would think that should never happen. Basically I am running the following code on SIGSEGV (which I send via
kill
to the process):for (size_t t=0; t < Gaudi::Concurrency::ConcurrencyFlags::numConcurrentEvents(); ++t){ std::string currentAlg; // copy the states to avoid modification by another thread while we examine it IAlgExecStateSvc::AlgStateMap_t states = algExecStateSvc->algExecStates(EventContext(0,t)); for (const auto& kv : states) { if (kv.second.state()==AlgExecState::State::Executing) currentAlg += (kv.first + " "); } } if (currentAlg.empty()) currentAlg = "<NONE>"; std::cout << "Slot " << t << " : Current algorithm = " << currentAlg << std::endl; }
and the result is e.g.:
Slot 0 : Current algorithm = <NONE> Slot 1 : Current algorithm = HiveAlgC Slot 2 : Current algorithm = HiveAlgF HiveAlgV Slot 3 : Current algorithm = HiveAlgC
The way I instrumented
AlgoExecutionTask
I thought it's guaranteed that the algorithm state is set toDone
before the next task is launched in the slot. But I guess I am missing something?@fwinkl : you can definitely have more than one Alg executing per slot - and in fact we hope to have many! remember, a slot is basically a single concurrent event. if different Algs can execute concurrently on the same event, because they have no data dependencies between them, then the scheduler may choose to execute them simultaneously on the same slot.
@leggett : of course! I keep forgetting that slot!=thread. So the above makes sense as we have still only 4 algorithms executing concurrently in total (and I ran with 4 threads). Thanks for clarifying.
@fwinkl Thanks! I'm fine with the MR.
@clemenci : when I click on the link to see why the build pipeline failed, I get a page with a big empty black box.
@leggett I think I found the problem in my fork of Gaudi. For some reason I had "shared runners" disabled, which problem meant it never ran the CI. I just restarted the pipeline and I see the log appearing...
added 1 commit
- f5198d63 - Fix bug introduced in Algorithm::setExecuted in previous commit
assigned to @clemenci
changed milestone to %v29r0
mentioned in commit 91fdcf00