Assorted CI improvements
-
The existing CI jobs can be used to generate reference files for all cards and sequences. They are stored as artifacts. -
Increase number of events in run_physics_efficiency
to 5000. This is now set in a variable at the top of.gitlab-ci.yml
-
Partly address #257 -
Improved throughput reporting #236 (closed) - complain about device-averaged throughput decrease exceeding 2.5% - fail the job on detection
- merge all reports into one message (one for minimal tests, one for the full tests)
- Ping the pipeline triggerer on mattermost to help increase awareness of the availability of these reports
- summarised throughput decrease alerts at the top of these messages
To-do
-
Needs rebase -
New throughput report can go over the 16k message limit on mattermost - try posting separate messages to the same thread to keep things grouped together
-
check the GPU_LINK_SPEED
is 16x
Merge request reports
Activity
added only GitLab CI testing labels
@dcampora would this be a good place to add reference files for all cards, with the default and scifiv6 sequence?
Yes, I think so! :)
For the full sequence, are there more physics configurations run?
Edited by Daniel Hugo Campora Perez(And a few more datasets. SMOG2_pppHe for example. I’ll have a more concrete look later/tomorrow)Edited by Ryunosuke O'NeilHere's all the reference files that were created:
minimal: https://gitlab.cern.ch/lhcb/Allen/-/jobs/14188338/artifacts/browse/generated_reference_files/ (All Cards - hlt1_pp_validation sequence only, bsphiphi data only)
full: https://gitlab.cern.ch/lhcb/Allen/-/jobs/14188442/artifacts/browse/generated_reference_files/ (All Cards - SciFiv6 sequence + data only)
Updating the reference files was really easy. I downloaded the artifacts archives from each of those jobs and dropped the reference files into
test/reference
. Then git handles everything else.git add test/reference/*.txt git commit -m "updated reference files." git push
Edited by Ryunosuke O'Neil
added 1 commit
- 1063580a - Add reference files for all cards, hlt1_pp_validation, and hlt1_pp_scifi_v6_validation.
I have updated the reference files a few times now and am seeing differences each time: https://gitlab.cern.ch/lhcb/Allen/-/jobs/14597346 e.g.
--- /builds/lhcb/Allen/test/reference/bsphiphi_mag_down_201907_hlt1_pp_validation_teslav100.txt 2021-06-18 14:54:30.060000000 +0200 +++ efficiency_bsphiphi_mag_down_201907_hlt1_pp_validation_teslav100.txt 2021-06-18 14:54:42.400000000 +0200 @@ -18,12 +18,12 @@ MC PV is isolated if dz to closest reconstructible MC PV > 10.00 mm REC and MC vertices matched by dz distance -All : 0.928 ( 22165/ 23891) -Isolated : 0.966 ( 11884/ 12297) +All : 0.928 ( 22164/ 23891) +Isolated : 0.966 ( 11883/ 12297) Close : 0.887 ( 10281/ 11594) -False rate : 0.013 ( 286/ 22451) -Real false rate : 0.013 ( 286/ 22451) -Clones : 0.000 ( 1/ 22165) +False rate : 0.013 ( 286/ 22450) +Real false rate : 0.013 ( 286/ 22450) +Clones : 0.000 ( 1/ 22164)
Very interesting! I am looking into it as this could explain as well the instability we see on the run change / no run change results.
One recent point that was found while discussing with a student was the use of FP atomicAdd on the PV code. I have created an issue about it: #258 . That may be related to this instability, although clearly there is something in the VELO tracking as well.
I was just writing down a similar comment :-) fully agree on the correlation with the run changes (see this issue #208 (closed)) and that we should look separately into the PV algorithm.
FYI: !578 (merged) should fix the instability observed in the Velo tracks.
Hopefully this thread is resolved now!
If the run changes and efficiency tests are now stable, perhaps we can remove
allow_failure : true
at this point? (or am I getting ahead of myself?)Edited by Ryunosuke O'NeilActually the run changes tests became unstable again with the use of a new dataset, so it looks like
allow_failure : true
should live for another day.Edited by Daniel Hugo Campora Perez
added 145 commits
-
64d9ce0e...244d4861 - 142 commits from branch
master
- c752e64b - make sure all generated reference files are archived, not just the ones with...
- 2acf0f19 - Add reference files for all cards, hlt1_pp_validation, and hlt1_pp_scifi_v6_validation.
- 77520aa3 - increase to 5000 events
Toggle commit list-
64d9ce0e...244d4861 - 142 commits from branch
mentioned in issue #257
added 1 commit
- eba090f9 - don't start full pipeline until "check" jobs pass.
changed title from Add reference files for all cards for hlt1_pp_validation and hlt1_pp_scifi_v6_validation; Ensure all generated reference files are briefly saved as CI artifacts to Add reference files for all cards for hlt1_pp_validation and hlt1_pp_scifi_v6_validation; reference files briefly saved as CI artifacts; fix #257; 1000 --> 5000 events in run_physics_efficiency
added 1 commit
- c57c3be6 - Try to monitor the load average while building
added 1 commit
- e6830639 - Add throughput decrease checker; refactoring of python scripts
added 1 commit
- 9d916898 - This doesn't have the desired behaviour yet.
changed title from Add reference files for all cards for hlt1_pp_validation and hlt1_pp_scifi_v6_validation; reference files briefly saved as CI artifacts; fix #257; 1000 --> 5000 events in run_physics_efficiency to Add reference files for all tested cards and sequences; reference files briefly saved as CI artifacts; fix #257; 5000 events in run_physics_efficiency
added 1 commit
- eefc07cc - fix get_master_throughput and complain if no reference available
added 2 commits
added 2 commits
This MR now also relates to #236 (closed) as the throughput job now fails and complains in Mattermost if the average throughput decreases by more than 2.5%.
The threshold may not be ideal so, anyone interested feel free to suggest a better setting.
Other than the device-averaged throughput I can make it possible to warn about single-device decreases as well
Edited by Ryunosuke O'Neil
mentioned in issue #236 (closed)