Skip to content
Snippets Groups Projects

Extend AlgExecState with an "executing" state

Merged Frank Winklmeier requested to merge fwinkl/Gaudi:algexecstate into master
All threads resolved!

The AlgExecState, accessible via the IAlgExecStateSvc, now records if an algorithm is currently being executed. This required some interface changes to the AlgExecState, i.e. replacing the boolean [set/is]Executed() with a enum-based [set]execState(). Clients have been updated accordingly. This also opens the door for further algorithm states if necessary later.

cc @leggett

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Frank Winklmeier
  • Edited by Software for LHCb
  • Just for my understanding, what specifically this new state is needed for?

    In general, this is going towards a full-fledged algorithm state tracking service, which I believe is the right way to go. Ultimately, the service would have to absorb all the states of GaudiHive/src/AlgsExecutionStates.cpp.

    For now, to make it more generic, may I only suggest removing the "exec" prefix in the namings where it is appropriate? E.g., changing setExecState() to setState(), etc.

    Edited by Illya Shapoval
  • added 1 commit

    • 10e46de4 - Rename setExecState() to setState() to prepare for furture extension

    Compare with previous version

  • Author Maintainer

    The immediate need was to identify the algorithms that are running when e.g. the job crashes or when an external timeout signal is raised. In the single-threaded Gaudi this was possible via the AlgContextSvc (see discussion in GAUDI-1305).

    I followed your suggestion and removed the "Exec" where appropriate.

  • Author Maintainer

    @ishapova, @leggett : While testing this code in some toy example it happens frequently that I find two algorithms marked as "executing" for a given slot. Naively I would think that should never happen. Basically I am running the following code on SIGSEGV (which I send via kill to the process):

    for (size_t t=0; t < Gaudi::Concurrency::ConcurrencyFlags::numConcurrentEvents(); ++t){
      std::string currentAlg;
      // copy the states to avoid modification by another thread while we examine it
      IAlgExecStateSvc::AlgStateMap_t states = algExecStateSvc->algExecStates(EventContext(0,t));
      for (const auto& kv : states) {
         if (kv.second.state()==AlgExecState::State::Executing)
              currentAlg += (kv.first + " ");
          }
      }
      if (currentAlg.empty()) currentAlg = "<NONE>";
      std::cout << "Slot   " << t << " : Current algorithm = " << currentAlg << std::endl;
    }

    and the result is e.g.:

    Slot   0 : Current algorithm = <NONE>
    Slot   1 : Current algorithm = HiveAlgC 
    Slot   2 : Current algorithm = HiveAlgF HiveAlgV 
    Slot   3 : Current algorithm = HiveAlgC 

    The way I instrumented AlgoExecutionTask I thought it's guaranteed that the algorithm state is set to Done before the next task is launched in the slot. But I guess I am missing something?

  • @fwinkl : you can definitely have more than one Alg executing per slot - and in fact we hope to have many! remember, a slot is basically a single concurrent event. if different Algs can execute concurrently on the same event, because they have no data dependencies between them, then the scheduler may choose to execute them simultaneously on the same slot.

  • Author Maintainer

    @leggett : of course! I keep forgetting that slot!=thread. So the above makes sense as we have still only 4 algorithms executing concurrently in total (and I ran with 4 threads). Thanks for clarifying.

  • @fwinkl Thanks! I'm fine with the MR.

  • @clemenci : when I click on the link to see why the build pipeline failed, I get a page with a big empty black box.

  • Author Maintainer

    @leggett I think I found the problem in my fork of Gaudi. For some reason I had "shared runners" disabled, which problem meant it never ran the CI. I just restarted the pipeline and I see the log appearing...

  • added 1 commit

    • f5198d63 - Fix bug introduced in Algorithm::setExecuted in previous commit

    Compare with previous version

  • Hadrien Benjamin Grasland approved this merge request

    approved this merge request

  • Marco Clemencic changed milestone to %v29r0

    changed milestone to %v29r0

  • Marco Clemencic resolved all discussions

    resolved all discussions

  • Marco Clemencic mentioned in commit 91fdcf00

    mentioned in commit 91fdcf00

  • Please register or sign in to reply
    Loading