Skip to content
Snippets Groups Projects

ChronoStatusSvc: Add locks for MT safety.

Merged Scott Snyder requested to merge ssnyder/Gaudi:chrono-20181204 into master

Add a lock around m_chronoEntities to prevent crashes occasionally seen in MT jobs. Note that the timing measurements will still be garbage in MT jobs --- this just tries to avoid the observed crashes.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • If this still produces wrong results, isn't a crash better as it won't be used rather to it being used while the output is wrong?

    Also note that there was an effort for a new timing in !787 (merged), maybe it's worth pushing that forward instead?

  • Edited by Software for LHCb
  • @chasse, I do not think a random crash is any good. It would be better to get a proper error or exception if you try to use the service in a MT job, or have the code run to the end (assuming you are not going to rely on the numbers it produces).

  • Marco Clemencic resolved all discussions

    resolved all discussions

  • Marco Clemencic approved this merge request

    approved this merge request

  • @ssnyder, this is OK for me, but I cannot merge until the pipeline is fixed.

  • @clemenci the proper error or exception approach is, of course, better, but this isn't what's implemented now, right?
    Making something seem like it could be used (passed the it "doesn't crash" test) while producing wrong results seems dangerous...

    Edited by Christoph Hasse
  • Until I have an alternative (error, exception, big warning...) this seems a valid option (or temporary workaround).

  • @ssnyder could one not you just throw an exception or better return a StatusCode::FAILURE when you go into a multithreaded environment with this service activated?

  • Scott Snyder added 1 commit

    added 1 commit

    Compare with previous version

  • Author Developer

    hi -

    The situation is that in Atlas, we have many job configurations, and we to start being able to run them with MT. ChronoStatSvc is enabled in many different places. From one test job, i see that if ChronoStatSvc is enabled in a MT job, i get random crashes due to it in ~ a few percent of runs.

    I know there have been discussions about fixing ChronoStatSvc properly, but as far as i know, there's nothing immediately ready. (If it is just about ready, then by all means we should do that instead.) Nevertheless, i'm assuming that this will in fact get fixed in the not-to-distant future. However, random crashes cause difficultly in validating other things, so one one would like to go and get rid of them if it can be done easily.

    As Christoph says, we could go through the configurations and add a test on MT everywhere ChronoStatSvc is configured. I'm reluctant to do that though since it's probably a bunch of places but more importantly because one would then need to remember to undo those changes once a properly fixed ChronoStatSvc is available.

    We could consider suppressing the ChronoStatSvc output for MT jobs. However, since we are expecting a proper fix, i wanted to keep this change relatively minimal. (My first try was actually to just replace the map with a tbb::concurrent_hash_map. However, this broke one of the Gaudi tests because a std::string was being leaked, and it didn't seem worth the effort to try to understand exactly why.) Similarly, Synced might be nicer, but as i'm expecting this code won't have a long lifetime (famous last words...), i don't think it's really worth changing.

    In summary, this is a (hopefully!) short-term fix to prevent random ChronoStatSvc crashes from interfering with other validation until a proper solution is ready.

  • could you print a warning (from the ChronoStatSvc) when in MT jobs that one can't trust the numbers? This way, stuff doesn't stop running but people (when reading the logs) at least have a clue whats going on..

  • Scott Snyder added 1 commit

    added 1 commit

    • 7916407a - Add a warning to ChonoStatSvc in MT jobs that statistics are unreliable.

    Compare with previous version

  • Scott Snyder added 1 commit

    added 1 commit

    Compare with previous version

  • Marco Clemencic mentioned in merge request !809 (merged)

    mentioned in merge request !809 (merged)

  • Marco Clemencic changed milestone to %v31r0

    changed milestone to %v31r0

  • Marco Clemencic resolved all discussions

    resolved all discussions

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading