B2OC: speed-up in B2OC D->4body and 3body builders (!2163) · Merge requests · LHCb / Moore

Ivan Polyakov requested to merge b2oc_ipolyako_d2kpipipi_speedup into b2oc_upgrade-abertoli Mar 13, 2023

A modification of 3body and 4body builders in B2OC allowing to reduce overall CPU usage at Hlt2 by >4.3% (selection part by >10.2%), see details below. Inspired by discussion with @mstahl.

I have no personal interest in pushing these changes, but maybe others would find it useful.

According to the hlt2_pp tests, log most cpu-expensive algorithms in selection are from B2OC. The builders D02KmPimPipPipCombiner_xxx, D02KpPipPimPimCombiner_xxx, D02PipPipPimPimCombiner_xxx, D02KpKmPipPimCombiner_xxx, Ds2KKPiCombiner_xxx, Xic02PKKPiCombiner_xxx, Omegac02PKKPiCombiner_xxx in sum give 218.4s/3358s = 6.5% of all cpu-usage.

See a comparison of a proposed modifications with default version (master) tested locally at lxplus on 1000 events:

log for default (master): hlt2_b2oc_speedup_1k.log
log for "partial" speedup, where cuts on F.M are added to 12&123 combiner cuts, but without splitting of make_threebody/fourbody functions: hlt2_b2oc_speedup-v0_1k.log
log of "full" speedup, where in addition to cuts on F.M added to 12&123 combiner cuts make_threebody/fourbody functions are split in two stages (combination and filtering) to reduce number of time combinatorics is done: hlt2_b2oc_speedup-master_1k.log

combiner	default (master)	"partial" speed-up	"full" speed-up
`D02KmPimPipPipCombiner_xxx`	2.93+1.56+1.56+1.56+1.55+1.39+0.91=11.46s	1.14+0.67+0.65+0.65+0.61+0.59+0.37 = 4.68s	1.14+0.65+0.59+0.59+0.41=3.38s
`D02KpPipPimPimCombiner_xxx`	2.79+1.52+1.51+1.49+1.48+1.34+0.90=11.03s	1.13+0.68+0.66+0.65+0.60+0.59+0.38 = 4.69s	1.12+0.65+0.59+0.59+0.37 = 3.32s
`Xic02PKKPiCombiner_xxx`	2.37+0.23 = 2.60s	0.71+0.08 = 0.79s	0.71+0.08 = 0.79s
`Omegac02PKKPiCombiner_xxx`	2.34+0.24 = 2.58s	1.00+0.09 = 1.09s	0.98+0.10 = 1.08s
`Ds2KKPiCombiner_xxx`	0.83+0.78+0.77 = 2.38s	0.70+0.65+0.65 = 2.00s	1.14+0.65 = 1.79s
sum	30.05s	13.25s	10.36s

Thus, "full" speedup reduces cpu-usage of corresponding algorithms by 66%. Thus, if recalculating to overall cpu-usage in hlt2 the numbers 4.3% and 10.2% at the top are obtained. As more lines are possibly affected the actual reduction might be even more.

small bonus, as suggested by @mstahl (and with input from @gligorov), naive recalculation into energy saved can be estimated as:

taking typical data taking year as 10h/day * 165days/year = 1650 hours.
taking typical power consumption either as
- 4000 of E5-2630 nodes * 320W/node = 1.3 MW (according to @gligorov);
- or as (3358s/20 threads) / 20k events * 1MHz * 85W/node = 0.7MW; see spec

gives 1160-2150 MWh / year. Taking energy price for France as 206 eur/MWh it results in 240-440k eur/year.

Thus reducing consumption by 4.3% gives 10-20k eur of savings a year, to be multiplied by ~3 for the whole Run3.

Edited Mar 13, 2023 by Ivan Polyakov

B2OC: speed-up in B2OC D->4body and 3body builders

Merge request reports