Error handling, fixes and speedup of bandwidth test

Regarding point 3, it might make sense to think about a bigger machine for these tests. The test is only running with 16 threads.

Regarding point 1, it looks to me as if !2338 (merged) broke it and it wasn't spotted.

Thank you @rmatev for this and the MR.

From my perspective Point 2 is definitely the priority, but might take some work.
As sascha says point 1 seems to have a tractable cause so a quick fix to that should be possible from one of us. .

I'll set aside some time to work on this. First priority is to get the test to work again and fail cleanly. I'll find the bit of code which publishes to gitlab and ensure that that gives a clear message in case of failure. I'll then think about low-hanging fruit to speed the test up. @lugrazet and I have already come across a few obvious things but haven't quantified how much speedup they'll achieve.

assigned to @rjhunter

I have a fix for point 1 (it's just missing options.persistreco_version = 0.0) and I have ideas for how to improve 2. 3 will take more thought/digging. MR(s) hopefully today.

This commit on PRConfig!316 will tackle point 1. Something to address point 2 will require more code changes so I'll do it in a follow-up MR.

lhcb-datapkg/PRConfig!316 (merged) was all merged, so point 1 is all done.

mentioned in issue #513 (closed)

!2603 (merged) was merged yesterday, which completes point 2 on the wishlist here.

The final step is to speed-up the tests in general, which I'm looking into now.

Why are the (HLT2) tests slow?

I've spent some time looking at logs recently, for example for this BW test, which took around 3 hours to complete. For the Moore_hlt2_bandwidth test there, the test has basically 4 bits of work to complete:

Run all HLT2 lines writing to 1 streamless MDF file,
Run all HLT2 with the production (turbo, full, turcal) streaming configuration,
Run all HLT2 lines streaming by WG.
Run a series of python/GaudiPython analysis steps to work out rates, BWs and ultimately compose the html pages.

The log above gives the following timing information on the first job:

Starts at 08:07:12,
Says "Welcome to Moore" at 08:11:25. That means it spent roughly 4 minutes downloading the MDF files it is to run over.
Next message is INFO ---> End of Initialization. This took 245082 ms;
Next it compiles the functor cache: INFO Compilation of functor library took 664 seconds.
Starts timing the event loop one-tenth into it at 08:28:14, and finishes it nine-tenths through at 08:40:54, with a speed of 105 Hz.
Log reports that the job took 35m23s. Adding up the previous times means finalisation takes a negligible time.

This equates to:

4m downloading,
4m initializing,
11m building functors,
16m running Moore at 100Hz,
~0 finalization time.

If I repeat the same exercise on jobs 2. and 3., the only difference is that the initialisation time increases with the number of streams (@nskidmor tells me there'll be a control flow /stream, so this is expected). Here are the respective times:

HLTControlFlowMgr                      INFO ---> End of Initialization. This took 245082 ms
HLTControlFlowMgr                      INFO ---> End of Initialization. This took 417307 ms
HLTControlFlowMgr                      INFO ---> End of Initialization. This took 1456444 ms

This is reflected in the reported time of jobs 1, 2, 3, which are 35m23s, 36m58s and 54m20s respectively. Indeed the functor compilation, download, and event loop all take around the same amount of time for the 3 jobs, it is just the initialization which makes their runtime differ. In the streamed-by-WG case, initialization takes 24 minutes!

Finally, the analysis scripts sum up to 40 minutes. I'm working on bringing that down - hopefully I can achieve a factor 2-4 just thinking about what is necessary.

Whats the sprucing equivalent?

A variety of things mean that the sprucing test is already around a factor of two faster. It doesn't have to do a production-streamed job (job 2), and the two Moore jobs left over take 5 and 17 mins respectively. Initialisation (30 seconds), the functor build (51 seconds) and the event loop (~165Hz) are all a lot faster in sprucing than in Moore, and here the equivalent of job 1 runs over just 10k events rather than 100k.

Why is the functor build so slow?

First of all, one would hope that we could pick up a built functor cache at least for the production-streamed job. It should be being built in the Moore CI tests upstream. I'm not an expert on the functor cache so haven't worked out yet how to do that.

Furthermore, I dug around the CI test logs from a recent lhcb-head and found:

test_hlt2_bandwidth_5streams takes a similar (11mins) time to build its functors,
test_hlt2_all_lines_with_reco_with_streams_mdf also need a fresh functor build, but in their case that build takes only 30-40 seconds.

Locally, I can reproduce the order of magnitude difference, although both jobs are a good deal faster on my machine. For me, the functor builds take:

test_hlt2_bandwidth_5streams: 137 seconds,
test_hlt2_all_lines_with_reco_with_streams_mdf: 17 seconds.

I'm currently working on understanding the difference between these two tests.

What can be done?

I've already shed a lot of the analysis jobs
Job 1 can be removed (it has two uses: 1) to create a massive .csv similarity matrix between all lines, which no-one is ever going to try to parse - removed, 2) to work out the similarity between WG or production streams. This can also be done by comparing the set of event numbers in the streamed .mdfs - you don't need a streamless mdf to do this).
Can we speed up the functor build, or pick up any already build functor caches?
Can we speed up the initialisation at all?

It seems that if I comment out the lines

# Use velo retina decoding:
make_VeloClusterTrackingSIMD.global_bind(
    algorithm=VeloRetinaClusterTrackingSIMD)
make_velo_full_clusters.global_bind(
    make_full_cluster=VPRetinaFullClusterDecoder)

from the options files that feed test_hlt2_all_lines_with_reco_with_streams_mdf.qmt then the FunctorFactory build slows by a factor of 10.

These options are present everywhere across the repo that runs over data from last year, but never there for running over old MC. I guess this is a case of we need new MC with retina clusters to be able to use them in the HLT2 reconstruction. I don't know if it is expected that this change alone would slow the FunctorFactory build so much.

The downloading should not happen every time. If this is the case then something changed in the production system. Ideally the file is there from the previous job.

I can try to look into this. The log is currently not verbose enough to really check if the downloading is happening every time - I just infer that it is by the delay between the job start time and the "Welcome to Moore" message.

mentioned in merge request lhcb-datapkg/PRConfig!349 (merged)

mentioned in issue #507 (closed)

mentioned in issue #554 (closed)

A bit more profiling on the initialisation time

At @sstahl's suggestion, I tried adding options.auditors = ['NameAuditor'] to the options file, but this didn't reveal any particular initialize() methods taking a very long time. However, by watching the console I saw that it was here:

HLTControlFlowMgr                      INFO  o TBB thread pool size:  'ThreadPoolSize':20

that the job hangs for a very long time.

I therefore added some more timing statements to HltControlFlowMgr to see where it is spending its time, and then ran this (locally, so quite a bit faster than the nightlies, but still slow) for the WG-streamed HLT2 job, which I've identified is the slowest initialiser. I get this log:

   194 HLTControlFlowMgr                      INFO Up to this point took 47516 ms
   195 HLTControlFlowMgr                      INFO Concurrency level information:
   196 HLTControlFlowMgr                      INFO  o Number of events slots: 24
   197 HLTControlFlowMgr                      INFO  o TBB thread pool size:  'ThreadPoolSize':20
   198 HLTControlFlowMgr                      INFO Building lines took 5144 ms
   199 HLTControlFlowMgr                      INFO Configuring scheduling took 624975 ms
   200 HLTControlFlowMgr                      INFO buildNodeStates took 1 ms
   201 HLTControlFlowMgr                      INFO registerStuff took 19 ms
   202 HLTControlFlowMgr                      INFO ---> End of Initialization. This took 677658 ms
   203 ApplicationMgr                         INFO Application Manager Initialized successfully

so, overwhelmingly it is the configureScheduling call that takes the most time.

If I instead do this for the production-streamed HLT2 job, I get this log:

227833 HLTControlFlowMgr                      INFO Up to this point took 47137 ms
227834 HLTControlFlowMgr                      INFO Concurrency level information:
227835 HLTControlFlowMgr                      INFO  o Number of events slots: 24
227836 HLTControlFlowMgr                      INFO  o TBB thread pool size:  'ThreadPoolSize':20
227837 HLTControlFlowMgr                      INFO Building lines took 4679 ms
227838 HLTControlFlowMgr                      INFO Configuring scheduling took 134879 ms
227839 HLTControlFlowMgr                      INFO buildNodeStates took 1 ms
227840 HLTControlFlowMgr                      INFO registerStuff took 18 ms
227841 HLTControlFlowMgr                      INFO ---> End of Initialization. This took 186716 ms
227842 ApplicationMgr                         INFO Application Manager Initialized successfully

So it is the configureScheduling call that grows with the number of streams.

There are 3 streams in the production-streams job, and 15 in the WG-streams job. 134879*15/3 = 674395; so we see the configureScheduling call run-time is almost exactly scaling with the number of streams.

If I look under Moore/Hlt/Hlt2Conf/python/Hlt2Conf/sprucing_settings/Sprucing_2023_1_production.py, there are only 8-9 TurboPass streams. So, whilst this would not address the root cause of this slowness, the initialisation time of the WG-streamed job (which takes 24 minutes in the nightlies) could be reduced by around 40% by aligning the streams with what actually comes out of the passthrough sprucing.

The streams it'd remove: dilepton, topo_b, pid, monitoring, c_to_dimuon and trackeff. Some of these go to TurCal, so this change would only be warranted if no-one was that interested in how the TurCal is streamed. dileption, topo_b and c_to_dimuon all go to FULL. So you could potentially have a job that has a full, TurCal and 8 WG streams, which more accurately represents that Turbo is streamed.

Not a game-changing proposal though and probably a bit of a tangent to this issue.

gaudi/Gaudi!1508 (merged) improves the initialisation time by a factor 2-2.5 (kudos to @rmatev).

As of the end of last week, the tests were moved to a new machine, which also had the accidental benefit of uncovering that they were being run at the same time as other tests, meaning that the tests were sharing all the cores of a machine with another set of tests. The new machine + fix of this bug has achieved a 40-50% speed-up in all the BW tests.

From last night's pages, the tests took:

Hlt2: 1hr45,
Hlt2 on 2023 MC: 1hr20,
Spruce: 17mins.

This is a nice milestone. We haven't yet profited from Rosen's speed-ups discussed above, and also I still hope we can pick up a functor cache here and there, so there is still speed-ups to be gained. We're getting closer to being able to close this issue though I think.

Update

With Gaudi v38 we picked up Rosen's initialisation speed-ups, and in !2971 (merged) I was able to retire the slowest part of the tests (the WG-streamed job). We now have a hlt2_and_spruce test completing in 1hr15: https://lhcbpr-hlt.web.cern.ch/UpgradeRateTest/BandwidthTest_lhcb-master.2310_Moore_hlt2_and_spruce_bandwidth_x86_64_v3-el9-gcc13+detdesc-opt+g_2024-02-22_03:01:37_+0100/, and I've seen a hlt2 test complete in ~45 minutes recently.

I think that's enough to close this issue now. Thanks @rmatev for the upstream scheduler improvements.

closed