Assorted CI improvements

changed the description

@dcampora would this be a good place to add reference files for all cards, with the default and scifiv6 sequence?

Yes, I think so! :)

For the full sequence, are there more physics configurations run?

Just scifi_v6 in addition to the default sequence I think.

~~(And a few more datasets. SMOG2_pppHe for example. I’ll have a more concrete look later/tomorrow)~~

Here's all the reference files that were created:

minimal: https://gitlab.cern.ch/lhcb/Allen/-/jobs/14188338/artifacts/browse/generated_reference_files/ (All Cards - hlt1_pp_validation sequence only, bsphiphi data only)

full: https://gitlab.cern.ch/lhcb/Allen/-/jobs/14188442/artifacts/browse/generated_reference_files/ (All Cards - SciFiv6 sequence + data only)

Updating the reference files was really easy. I downloaded the artifacts archives from each of those jobs and dropped the reference files into test/reference. Then git handles everything else.

git add test/reference/*.txt
git commit -m "updated reference files."
git push

This is very nice. Can you please update the documentation here accordingly when this is merged?

added 1 commit

1063580a - Add reference files for all cards, hlt1_pp_validation, and hlt1_pp_scifi_v6_validation.

Compare with previous version

changed title from Ensure all generated reference files are briefly saved as CI artifacts to Add reference files for all cards for hlt1_pp_validation and hlt1_pp_scifi_v6_validation; Ensure all generated reference files are briefly saved as CI artifacts

changed the description

@roneil @dovombru @gligorov @rmatev may I suggest as part of this MR that we move to physics efficiency tests of 5k events?

Rebased and changed -n 1000 --> -n 5000. I will need to update the reference files as they might have gotten mangled during the rebase.

Good idea to use larger data sets for the physics tests!

I have updated the reference files a few times now and am seeing differences each time: https://gitlab.cern.ch/lhcb/Allen/-/jobs/14597346 e.g.

--- /builds/lhcb/Allen/test/reference/bsphiphi_mag_down_201907_hlt1_pp_validation_teslav100.txt	2021-06-18 14:54:30.060000000 +0200
+++ efficiency_bsphiphi_mag_down_201907_hlt1_pp_validation_teslav100.txt	2021-06-18 14:54:42.400000000 +0200
@@ -18,12 +18,12 @@
 MC PV is isolated if dz to closest reconstructible MC PV > 10.00 mm
 REC and MC vertices matched by dz distance
 
-All                  :  0.928 ( 22165/ 23891)
-Isolated             :  0.966 ( 11884/ 12297)
+All                  :  0.928 ( 22164/ 23891)
+Isolated             :  0.966 ( 11883/ 12297)
 Close                :  0.887 ( 10281/ 11594)
-False rate           :  0.013 (   286/ 22451)
-Real false rate      :  0.013 (   286/ 22451)
-Clones               :  0.000 (     1/ 22165)
+False rate           :  0.013 (   286/ 22450)
+Real false rate      :  0.013 (   286/ 22450)
+Clones               :  0.000 (     1/ 22164)

Very interesting! I am looking into it as this could explain as well the instability we see on the run change / no run change results.

One recent point that was found while discussing with a student was the use of FP atomicAdd on the PV code. I have created an issue about it: #258 . That may be related to this instability, although clearly there is something in the VELO tracking as well.

I was just writing down a similar comment :-) fully agree on the correlation with the run changes (see this issue #208 (closed)) and that we should look separately into the PV algorithm.

I found out the reason for the VELO discrepancy, which is due to several hits ending up having the same phi. I am exploring various solutions with an eye on throughput.

FYI @freiss for the differences in PVs we observe with 5k events.

FYI: !578 (merged) should fix the instability observed in the Velo tracks.

Great! I'll rebase and re-run the tests again once that's in.

Hopefully this thread is resolved now!

If the run changes and efficiency tests are now stable, perhaps we can remove allow_failure : true at this point? (or am I getting ahead of myself?)

I agree, this should be removed now.

Actually the run changes tests became unstable again with the use of a new dataset, so it looks like allow_failure : true should live for another day.

added 1 commit

64d9ce0e - increase to 5000 events

Compare with previous version

added 145 commits

64d9ce0e...244d4861 - 142 commits from branch master
c752e64b - make sure all generated reference files are archived, not just the ones with...
2acf0f19 - Add reference files for all cards, hlt1_pp_validation, and hlt1_pp_scifi_v6_validation.
77520aa3 - increase to 5000 events

Compare with previous version

mentioned in issue #257

added 1 commit

a0dcea22 - update reference files

Compare with previous version

added 1 commit

df72d90b - Extend artifact expiry

Compare with previous version

added 1 commit

646da334 - Update reference files again.

Compare with previous version

added 1 commit

eba090f9 - don't start full pipeline until "check" jobs pass.

Compare with previous version

added 1 commit

7bec943d - update complex validation reference files

Compare with previous version

changed the description

changed title from Add reference files for all cards for hlt1_pp_validation and hlt1_pp_scifi_v6_validation; Ensure all generated reference files are briefly saved as CI artifacts to Add reference files for all cards for hlt1_pp_validation and hlt1_pp_scifi_v6_validation; reference files briefly saved as CI artifacts; fix #257; 1000 --> 5000 events in run_physics_efficiency

added 1 commit

c57c3be6 - Try to monitor the load average while building

Compare with previous version

added 1 commit

0720ffb8 - Remove sequence generation builds.

Compare with previous version

Something that would be very useful as well is a comparison of the different reference files. I.e. to quantify the difference in results across cards and architectures. Do you think you could add this as part of this MR @roneil ?

Sure. I'll have a think about how to do this

Awesome, thanks a lot!

added 1 commit

e6830639 - Add throughput decrease checker; refactoring of python scripts

Compare with previous version

added 1 commit

78c59672 - add needs to run_jobs_full as well

Compare with previous version

added 1 commit

65838264 - allow_failure: false with when:manual

Compare with previous version

added 1 commit

22b18cbe - try with optional needs

Compare with previous version

added 1 commit

9d916898 - This doesn't have the desired behaviour yet.

Compare with previous version

added 1 commit

da82a44d - readd missing needs: []

Compare with previous version

added 2 commits

1eba851b - Fixed formatting
8a2ae0d1 - copyright

Compare with previous version

added 1 commit

d889c91d - Fixed formatting

Compare with previous version

added 1 commit

5b9db3b2 - Fix imports

Compare with previous version

added 1 commit

855e1d07 - no rel import

Compare with previous version

added 1 commit

d20e26ea - Fixed formatting

Compare with previous version

added 1 commit

73cac5c2 - rename module

Compare with previous version

added 1 commit

6c51f959 - overhaul reporting

Compare with previous version

added 1 commit

43b975d7 - Fixed formatting

Compare with previous version

changed title from Add reference files for all cards for hlt1_pp_validation and hlt1_pp_scifi_v6_validation; reference files briefly saved as CI artifacts; fix #257; 1000 --> 5000 events in run_physics_efficiency to Add reference files for all tested cards and sequences; reference files briefly saved as CI artifacts; fix #257; 5000 events in run_physics_efficiency

changed the description

added 1 commit

eefc07cc - fix get_master_throughput and complain if no reference available

Compare with previous version

added 2 commits

bc98a8fc - ok is a member of response, not a function
aa11caca - Add more info to plot

Compare with previous version

added 2 commits

d25b9723 - Clean up formatting
5645d4cc - Fit on one line instead

Compare with previous version

changed title from Add reference files for all tested cards and sequences; reference files briefly saved as CI artifacts; fix #257; 5000 events in run_physics_efficiency to Assorted CI improvements

changed the description

This MR now also relates to #236 (closed) as the throughput job now fails and complains in Mattermost if the average throughput decreases by more than 2.5%.

The threshold may not be ideal so, anyone interested feel free to suggest a better setting.

Other than the device-averaged throughput I can make it possible to warn about single-device decreases as well

A single-device decrease notification would be useful. We have seen different behavior of the cards in the past.

As I commented in the issue, I think 2.5% is a good starting point.

changed the description

mentioned in issue #236 (closed)

changed the description

Assorted CI improvements

Closed by Rosen Matev 3 years ago (Feb 16, 2022 11:32pm UTC) 3 years ago

Activity

Assorted CI improvements

Merge request reports

Closed by Rosen Matev 3 years ago (Feb 16, 2022 11:32pm UTC) 3 years ago

Activity