Skip to content
Snippets Groups Projects

PerfMonComps: Switch from per-slot to per-thread metric collection in PerfMonMT

Originally we were collecting algorithm execution statistics on a per-slot basis. Each slot had its own map, protected by a mutex to avoid concurrent threads altering a slot's map simultaneously, where the key is basically a pair of the algorithm name and the execution step. This works fine in almost all cases except for algorithms that execute within a view (i.e. trigger). In that scenario, the same algorithm can execute multiple times per event, i.e. from concurrent threads for the same slot. This causes asynchronous auditor calls which mess up the metrics (but not necessarily a threading problem per-se due to locks).

Now we collect metrics on a per-thread basis. This way, we not only work around the afore mentioned limitation, but also can get rid of the locks as threads have their own maps now. In order to accomplish this, we make use of tbb::this_task_arena::current_thread_index() which provides a unique index within a task arena, and use that as the index of the vector (constructed during initialization) that holds the maps. Here we rely on the fact that we use a single task arena to run all algorithms, hence tbb::this_task_arena::current_thread_index() is unique. The new approach is tested in RAWtoALL and RDOtoRDOTrigger in 16 threads and seems to work as expected.

There are other ways to go about this. We could keep the original approach (per-slot) but give a unique identifier to algorithms running in separate views (e.g. view id) but we would still have to lock. If at any point tbb::this_task_arena::current_thread_index() is not unique (multiple parallel arenas), we can think about using the thread ids instead but that might not be as seamless.

In any case, let's see how this goes. Note that none of the official workflows should be affected by this change as we run detailed monitoring only in SPOT daily tests for the time being.

Edited by Alaettin Serhan Mete

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • This merge request affects 1 package:

    • Control/PerformanceMonitoring/PerfMonComps

    This merge request affects 6 files:

    • Control/PerformanceMonitoring/PerfMonComps/CMakeLists.txt
    • Control/PerformanceMonitoring/PerfMonComps/python/MTJobOptCfg.py
    • Control/PerformanceMonitoring/PerfMonComps/python/PerfMonCompsConfig.py
    • Control/PerformanceMonitoring/PerfMonComps/share/PerfMonMTSvc_jobOptions.py
    • Control/PerformanceMonitoring/PerfMonComps/src/PerfMonMTSvc.cxx
    • Control/PerformanceMonitoring/PerfMonComps/src/PerfMonMTSvc.h

    Adding @amete ,@ssnyder as watchers

  • :negative_squared_cross_mark: CI Result FAILURE

    Athena AthSimulation AthGeneration AnalysisBase AthAnalysis DetCommon
    externals :white_check_mark: :white_check_mark: :white_check_mark: :o: :white_check_mark: :white_check_mark:
    cmake N/A N/A N/A N/A N/A N/A
    make N/A N/A N/A N/A N/A N/A
    required tests N/A N/A N/A N/A N/A N/A
    optional tests N/A N/A N/A N/A N/A N/A

    :exclamation: Due to problems in externals build or cmake configuration the job is stopped, results are not available on the ATLAS CI monitor :white_check_mark: Athena: number of compilation errors 0, warnings 0
    :white_check_mark: AthSimulation: number of compilation errors 0, warnings 0
    :white_check_mark: AthGeneration: number of compilation errors 0, warnings 0
    :white_check_mark: AnalysisBase: number of compilation errors 0, warnings 0
    :white_check_mark: AthAnalysis: number of compilation errors 0, warnings 0
    :white_check_mark: DetCommon: number of compilation errors 0, warnings 0
    :pencil: For experts only: Jenkins output [CI-MERGE-REQUEST-CC7 41043]

  • Check build config

    Sergei L1

  • I have no idea how this MR can cause the build for the externals to fail and given the same issue is there in other MRs, e.g. !47612 (merged), I think there is something going on w/ the CI. Any ideas @aundrus?

  • Jenkins please retry a build

  • This merge request affects 1 package:

    • Control/PerformanceMonitoring/PerfMonComps

    This merge request affects 6 files:

    • Control/PerformanceMonitoring/PerfMonComps/CMakeLists.txt
    • Control/PerformanceMonitoring/PerfMonComps/python/MTJobOptCfg.py
    • Control/PerformanceMonitoring/PerfMonComps/python/PerfMonCompsConfig.py
    • Control/PerformanceMonitoring/PerfMonComps/share/PerfMonMTSvc_jobOptions.py
    • Control/PerformanceMonitoring/PerfMonComps/src/PerfMonMTSvc.cxx
    • Control/PerformanceMonitoring/PerfMonComps/src/PerfMonMTSvc.h

    Adding @amete ,@ssnyder as watchers

    • Resolved by Frank Winklmeier

      The retry proceeded successfully and built externals successfully.

      As concerns errors of the previous externals builds:

      Apparently @fwinkl aborted build for MR!47611 on aibuild16-006 : https://atlas-sit-ci.cern.ch/job/CI-MERGE-REQUEST-CC7/41036/. This caused corruption of build area and all further externals builds on this machine failed. I always cautioned against killing CI builds because it can leave some binaries corrupted and break next incremental builds.

      I will clean the local build area of aibuild16-006 momentarily, create ATLINFR ticket to record the incidence of build area damage after job interruption

  • Alaettin Serhan Mete resolved all threads

    resolved all threads

  • Hi @fwinkl, I must admit that I also cancel job via Jenkins web interface occasionally. In most cases it works OK. I think the problem can arise when clicking on Jenkins cancel icon does not stop all processes on the machine immediately. Then the next job starts and if there are lingering gcc processes from the previous build then corruption is possible.

    For aibuild16-006 I just paused the CI service, stopped an ongoing build at externals stage from Jenkins, logged on this machine and observed few ongoing gcc processes. I waited couple of minutes and then killed them manually as I needed to clean the disk and bring the machine back in service.

  • :negative_squared_cross_mark: CI Result FAILURE (hash d9aca078)

    Athena AthSimulation AthGeneration AnalysisBase AthAnalysis DetCommon
    externals :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
    cmake :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
    make :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
    required tests :o: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
    optional tests :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:

    Full details available on this CI monitor view
    :white_check_mark: Athena: number of compilation errors 0, warnings 0
    :white_check_mark: AthSimulation: number of compilation errors 0, warnings 0
    :white_check_mark: AthGeneration: number of compilation errors 0, warnings 0
    :white_check_mark: AnalysisBase: number of compilation errors 0, warnings 0
    :white_check_mark: AthAnalysis: number of compilation errors 0, warnings 0
    :white_check_mark: DetCommon: number of compilation errors 0, warnings 0
    :pencil: For experts only: Jenkins output [CI-MERGE-REQUEST-CC7 41046]

  • The failure in Trigger_athena_MC-test is not related to this MR.

  • added 2 commits

    • 7c76a35a - Undo some unnecessary changes
    • 8aa2e1d3 - Switch to this_task_arena namespace

    Compare with previous version

  • This merge request affects 1 package:

    • Control/PerformanceMonitoring/PerfMonComps

    This merge request affects 3 files:

    • Control/PerformanceMonitoring/PerfMonComps/CMakeLists.txt
    • Control/PerformanceMonitoring/PerfMonComps/src/PerfMonMTSvc.cxx
    • Control/PerformanceMonitoring/PerfMonComps/src/PerfMonMTSvc.h

    Adding @amete ,@ssnyder as watchers

  • Alaettin Serhan Mete changed the description

    changed the description

  • :negative_squared_cross_mark: CI Result FAILURE (hash 8aa2e1d3)

    Athena AthSimulation AthGeneration AnalysisBase AthAnalysis DetCommon
    externals :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
    cmake :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
    make :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
    required tests :o: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
    optional tests :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:

    Full details available on this CI monitor view
    :white_check_mark: Athena: number of compilation errors 0, warnings 0
    :white_check_mark: AthSimulation: number of compilation errors 0, warnings 0
    :white_check_mark: AthGeneration: number of compilation errors 0, warnings 0
    :white_check_mark: AnalysisBase: number of compilation errors 0, warnings 0
    :white_check_mark: AthAnalysis: number of compilation errors 0, warnings 0
    :white_check_mark: DetCommon: number of compilation errors 0, warnings 0
    :pencil: For experts only: Jenkins output [CI-MERGE-REQUEST-CC7 41094]

  • Trigger_athena_MC-test failure is unrelated to this MR.

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading