PerfMonComps: Switch from per-slot to per-thread metric collection in PerfMonMT
Originally we were collecting algorithm execution statistics on a per-slot basis. Each slot had its own map, protected by a mutex to avoid concurrent threads altering a slot's map simultaneously, where the key is basically a pair of the algorithm name and the execution step. This works fine in almost all cases except for algorithms that execute within a view (i.e. trigger). In that scenario, the same algorithm can execute multiple times per event, i.e. from concurrent threads for the same slot. This causes asynchronous auditor calls which mess up the metrics (but not necessarily a threading problem per-se due to locks).
Now we collect metrics on a per-thread basis. This way, we not only work around the afore mentioned limitation, but also can get rid of the locks as threads have their own maps now. In order to accomplish this, we make use of tbb::this_task_arena::current_thread_index()
which provides a unique index within a task arena, and use that as the index of the vector (constructed during initialization) that holds the maps. Here we rely on the fact that we use a single task arena to run all algorithms, hence tbb::this_task_arena::current_thread_index()
is unique. The new approach is tested in RAWtoALL
and RDOtoRDOTrigger
in 16 threads and seems to work as expected.
There are other ways to go about this. We could keep the original approach (per-slot) but give a unique identifier to algorithms running in separate views (e.g. view id) but we would still have to lock. If at any point tbb::this_task_arena::current_thread_index()
is not unique (multiple parallel arenas), we can think about using the thread ids instead but that might not be as seamless.
In any case, let's see how this goes. Note that none of the official workflows should be affected by this change as we run detailed monitoring only in SPOT daily tests for the time being.
Merge request reports
Activity
This merge request affects 1 package:
- Control/PerformanceMonitoring/PerfMonComps
This merge request affects 6 files:
- Control/PerformanceMonitoring/PerfMonComps/CMakeLists.txt
- Control/PerformanceMonitoring/PerfMonComps/python/MTJobOptCfg.py
- Control/PerformanceMonitoring/PerfMonComps/python/PerfMonCompsConfig.py
- Control/PerformanceMonitoring/PerfMonComps/share/PerfMonMTSvc_jobOptions.py
- Control/PerformanceMonitoring/PerfMonComps/src/PerfMonMTSvc.cxx
- Control/PerformanceMonitoring/PerfMonComps/src/PerfMonMTSvc.h
added Core master review-pending-level-1 labels
CI Result FAILUREAthena AthSimulation AthGeneration AnalysisBase AthAnalysis DetCommon externals cmake N/A N/A N/A N/A N/A N/A make N/A N/A N/A N/A N/A N/A required tests N/A N/A N/A N/A N/A N/A optional tests N/A N/A N/A N/A N/A N/A Due to problems in externals build or cmake configuration the job is stopped, results are not available on the ATLAS CI monitor Athena: number of compilation errors 0, warnings 0
AthSimulation: number of compilation errors 0, warnings 0
AthGeneration: number of compilation errors 0, warnings 0
AnalysisBase: number of compilation errors 0, warnings 0
AthAnalysis: number of compilation errors 0, warnings 0
DetCommon: number of compilation errors 0, warnings 0
For experts only: Jenkins output [CI-MERGE-REQUEST-CC7 41043]added review-user-action-required label and removed review-pending-level-1 label
I have no idea how this MR can cause the build for the externals to fail and given the same issue is there in other MRs, e.g. !47612 (merged), I think there is something going on w/ the CI. Any ideas @aundrus?
This merge request affects 1 package:
- Control/PerformanceMonitoring/PerfMonComps
This merge request affects 6 files:
- Control/PerformanceMonitoring/PerfMonComps/CMakeLists.txt
- Control/PerformanceMonitoring/PerfMonComps/python/MTJobOptCfg.py
- Control/PerformanceMonitoring/PerfMonComps/python/PerfMonCompsConfig.py
- Control/PerformanceMonitoring/PerfMonComps/share/PerfMonMTSvc_jobOptions.py
- Control/PerformanceMonitoring/PerfMonComps/src/PerfMonMTSvc.cxx
- Control/PerformanceMonitoring/PerfMonComps/src/PerfMonMTSvc.h
added review-pending-level-1 label and removed review-user-action-required label
- Resolved by Frank Winklmeier
The retry proceeded successfully and built externals successfully.
As concerns errors of the previous externals builds:
Apparently @fwinkl aborted build for MR!47611 on aibuild16-006 : https://atlas-sit-ci.cern.ch/job/CI-MERGE-REQUEST-CC7/41036/. This caused corruption of build area and all further externals builds on this machine failed. I always cautioned against killing CI builds because it can leave some binaries corrupted and break next incremental builds.
I will clean the local build area of aibuild16-006 momentarily, create ATLINFR ticket to record the incidence of build area damage after job interruption
Hi @fwinkl, I must admit that I also cancel job via Jenkins web interface occasionally. In most cases it works OK. I think the problem can arise when clicking on Jenkins cancel icon does not stop all processes on the machine immediately. Then the next job starts and if there are lingering gcc processes from the previous build then corruption is possible.
For aibuild16-006 I just paused the CI service, stopped an ongoing build at externals stage from Jenkins, logged on this machine and observed few ongoing gcc processes. I waited couple of minutes and then killed them manually as I needed to clean the disk and bring the machine back in service.
CI Result FAILURE (hash d9aca078)Athena AthSimulation AthGeneration AnalysisBase AthAnalysis DetCommon externals cmake make required tests optional tests Full details available on this CI monitor view
Athena: number of compilation errors 0, warnings 0
AthSimulation: number of compilation errors 0, warnings 0
AthGeneration: number of compilation errors 0, warnings 0
AnalysisBase: number of compilation errors 0, warnings 0
AthAnalysis: number of compilation errors 0, warnings 0
DetCommon: number of compilation errors 0, warnings 0
For experts only: Jenkins output [CI-MERGE-REQUEST-CC7 41046]added 2 commits
This merge request affects 1 package:
- Control/PerformanceMonitoring/PerfMonComps
This merge request affects 3 files:
- Control/PerformanceMonitoring/PerfMonComps/CMakeLists.txt
- Control/PerformanceMonitoring/PerfMonComps/src/PerfMonMTSvc.cxx
- Control/PerformanceMonitoring/PerfMonComps/src/PerfMonMTSvc.h
CI Result FAILURE (hash 8aa2e1d3)Athena AthSimulation AthGeneration AnalysisBase AthAnalysis DetCommon externals cmake make required tests optional tests Full details available on this CI monitor view
Athena: number of compilation errors 0, warnings 0
AthSimulation: number of compilation errors 0, warnings 0
AthGeneration: number of compilation errors 0, warnings 0
AnalysisBase: number of compilation errors 0, warnings 0
AthAnalysis: number of compilation errors 0, warnings 0
DetCommon: number of compilation errors 0, warnings 0
For experts only: Jenkins output [CI-MERGE-REQUEST-CC7 41094]