@lugrazet@rjhunter@shunan@msaur this fixes the problem reported in mattermost. I've not compared the change in the report in detail but on the high level there seems to be little difference (the test on lhcb-master is not finished but one can compare lhcb-head to the mr test)
We still need a couple of follow ups:
The following step is failing with KeyError: 'LongProtos'
# 2. Compute line descriptives: persist reco, extra outputecho 'Obtaining line descriptives'time gaudirun.py $PRCONFIGROOT/python/MooreTests/line-descriptives.py
We need to detect when one of the steps fail and make it clear in the report (and in the feedback comment on gitlab)
2.5-3 hours seems like a rather long runtime for me, I imagine (didn't check) that we can easily saturate the runners. We should try to improve this.
From my perspective Point 2 is definitely the priority, but might take some work.
As sascha says point 1 seems to have a tractable cause so a quick fix to that should be possible from one of us. .
I'll set aside some time to work on this. First priority is to get the test to work again and fail cleanly. I'll find the bit of code which publishes to gitlab and ensure that that gives a clear message in case of failure. I'll then think about low-hanging fruit to speed the test up. @lugrazet and I have already come across a few obvious things but haven't quantified how much speedup they'll achieve.
I have a fix for point 1 (it's just missing options.persistreco_version = 0.0) and I have ideas for how to improve 2. 3 will take more thought/digging. MR(s) hopefully today.
I've spent some time looking at logs recently, for example for this BW test, which took around 3 hours to complete. For the Moore_hlt2_bandwidth test there, the test has basically 4 bits of work to complete:
Run all HLT2 lines writing to 1 streamless MDF file,
Run all HLT2 with the production (turbo, full, turcal) streaming configuration,
Run all HLT2 lines streaming by WG.
Run a series of python/GaudiPython analysis steps to work out rates, BWs and ultimately compose the html pages.
The log above gives the following timing information on the first job:
Starts at 08:07:12,
Says "Welcome to Moore" at 08:11:25. That means it spent roughly 4 minutes downloading the MDF files it is to run over.
Next message is INFO ---> End of Initialization. This took 245082 ms;
Next it compiles the functor cache: INFO Compilation of functor library took 664 seconds.
Starts timing the event loop one-tenth into it at 08:28:14, and finishes it nine-tenths through at 08:40:54, with a speed of 105 Hz.
Log reports that the job took 35m23s. Adding up the previous times means finalisation takes a negligible time.
This equates to:
4m downloading,
4m initializing,
11m building functors,
16m running Moore at 100Hz,
~0 finalization time.
If I repeat the same exercise on jobs 2. and 3., the only difference is that the initialisation time increases with the number of streams (@nskidmor tells me there'll be a control flow /stream, so this is expected). Here are the respective times:
HLTControlFlowMgr INFO ---> End of Initialization. This took 245082 msHLTControlFlowMgr INFO ---> End of Initialization. This took 417307 msHLTControlFlowMgr INFO ---> End of Initialization. This took 1456444 ms
This is reflected in the reported time of jobs 1, 2, 3, which are 35m23s, 36m58s and 54m20s respectively. Indeed the functor compilation, download, and event loop all take around the same amount of time for the 3 jobs, it is just the initialization which makes their runtime differ. In the streamed-by-WG case, initialization takes 24 minutes!
Finally, the analysis scripts sum up to 40 minutes. I'm working on bringing that down - hopefully I can achieve a factor 2-4 just thinking about what is necessary.
Whats the sprucing equivalent?
A variety of things mean that the sprucing test is already around a factor of two faster. It doesn't have to do a production-streamed job (job 2), and the two Moore jobs left over take 5 and 17 mins respectively. Initialisation (30 seconds), the functor build (51 seconds) and the event loop (~165Hz) are all a lot faster in sprucing than in Moore, and here the equivalent of job 1 runs over just 10k events rather than 100k.
Why is the functor build so slow?
First of all, one would hope that we could pick up a built functor cache at least for the production-streamed job. It should be being built in the Moore CI tests upstream. I'm not an expert on the functor cache so haven't worked out yet how to do that.
Furthermore, I dug around the CI test logs from a recent lhcb-head and found:
test_hlt2_bandwidth_5streams takes a similar (11mins) time to build its functors,
test_hlt2_all_lines_with_reco_with_streams_mdf also need a fresh functor build, but in their case that build takes only 30-40 seconds.
Locally, I can reproduce the order of magnitude difference, although both jobs are a good deal faster on my machine. For me, the functor builds take:
I'm currently working on understanding the difference between these two tests.
What can be done?
I've already shed a lot of the analysis jobs
Job 1 can be removed (it has two uses: 1) to create a massive .csv similarity matrix between all lines, which no-one is ever going to try to parse - removed, 2) to work out the similarity between WG or production streams. This can also be done by comparing the set of event numbers in the streamed .mdfs - you don't need a streamless mdf to do this).
Can we speed up the functor build, or pick up any already build functor caches?
# Use velo retina decoding:make_VeloClusterTrackingSIMD.global_bind( algorithm=VeloRetinaClusterTrackingSIMD)make_velo_full_clusters.global_bind( make_full_cluster=VPRetinaFullClusterDecoder)
from the options files that feed test_hlt2_all_lines_with_reco_with_streams_mdf.qmt then the FunctorFactory build slows by a factor of 10.
These options are present everywhere across the repo that runs over data from last year, but never there for running over old MC. I guess this is a case of we need new MC with retina clusters to be able to use them in the HLT2 reconstruction. I don't know if it is expected that this change alone would slow the FunctorFactory build so much.
The downloading should not happen every time. If this is the case then something changed in the production system. Ideally the file is there from the previous job.
I can try to look into this. The log is currently not verbose enough to really check if the downloading is happening every time - I just infer that it is by the delay between the job start time and the "Welcome to Moore" message.
At @sstahl's suggestion, I tried adding options.auditors = ['NameAuditor'] to the options file, but this didn't reveal any particular initialize() methods taking a very long time. However, by watching the console I saw that it was here:
HLTControlFlowMgr INFO o TBB thread pool size: 'ThreadPoolSize':20
that the job hangs for a very long time.
I therefore added some more timing statements to HltControlFlowMgr to see where it is spending its time, and then ran this (locally, so quite a bit faster than the nightlies, but still slow) for the WG-streamed HLT2 job, which I've identified is the slowest initialiser. I get this log:
194 HLTControlFlowMgr INFO Up to this point took 47516 ms 195 HLTControlFlowMgr INFO Concurrency level information: 196 HLTControlFlowMgr INFO o Number of events slots: 24 197 HLTControlFlowMgr INFO o TBB thread pool size: 'ThreadPoolSize':20 198 HLTControlFlowMgr INFO Building lines took 5144 ms 199 HLTControlFlowMgr INFO Configuring scheduling took 624975 ms 200 HLTControlFlowMgr INFO buildNodeStates took 1 ms 201 HLTControlFlowMgr INFO registerStuff took 19 ms 202 HLTControlFlowMgr INFO ---> End of Initialization. This took 677658 ms 203 ApplicationMgr INFO Application Manager Initialized successfully
so, overwhelmingly it is the configureScheduling call that takes the most time.
If I instead do this for the production-streamed HLT2 job, I get this log:
227833 HLTControlFlowMgr INFO Up to this point took 47137 ms227834 HLTControlFlowMgr INFO Concurrency level information:227835 HLTControlFlowMgr INFO o Number of events slots: 24227836 HLTControlFlowMgr INFO o TBB thread pool size: 'ThreadPoolSize':20227837 HLTControlFlowMgr INFO Building lines took 4679 ms227838 HLTControlFlowMgr INFO Configuring scheduling took 134879 ms227839 HLTControlFlowMgr INFO buildNodeStates took 1 ms227840 HLTControlFlowMgr INFO registerStuff took 18 ms227841 HLTControlFlowMgr INFO ---> End of Initialization. This took 186716 ms227842 ApplicationMgr INFO Application Manager Initialized successfully
So it is the configureScheduling call that grows with the number of streams.
There are 3 streams in the production-streams job, and 15 in the WG-streams job. 134879*15/3 = 674395; so we see the configureScheduling call run-time is almost exactly scaling with the number of streams.
If I look under Moore/Hlt/Hlt2Conf/python/Hlt2Conf/sprucing_settings/Sprucing_2023_1_production.py, there are only 8-9 TurboPass streams. So, whilst this would not address the root cause of this slowness, the initialisation time of the WG-streamed job (which takes 24 minutes in the nightlies) could be reduced by around 40% by aligning the streams with what actually comes out of the passthrough sprucing.
The streams it'd remove: dilepton, topo_b, pid, monitoring, c_to_dimuon and trackeff. Some of these go to TurCal, so this change would only be warranted if no-one was that interested in how the TurCal is streamed. dileption, topo_b and c_to_dimuon all go to FULL. So you could potentially have a job that has a full, TurCal and 8 WG streams, which more accurately represents that Turbo is streamed.
Not a game-changing proposal though and probably a bit of a tangent to this issue.
As of the end of last week, the tests were moved to a new machine, which also had the accidental benefit of uncovering that they were being run at the same time as other tests, meaning that the tests were sharing all the cores of a machine with another set of tests. The new machine + fix of this bug has achieved a 40-50% speed-up in all the BW tests.
This is a nice milestone. We haven't yet profited from Rosen's speed-ups discussed above, and also I still hope we can pick up a functor cache here and there, so there is still speed-ups to be gained. We're getting closer to being able to close this issue though I think.