Skip to content

B2OC: speed-up in B2OC D->4body and 3body builders

A modification of 3body and 4body builders in B2OC allowing to reduce overall CPU usage at Hlt2 by >4.3% (selection part by >10.2%), see details below. Inspired by discussion with @mstahl.

I have no personal interest in pushing these changes, but maybe others would find it useful.

According to the hlt2_pp tests, log most cpu-expensive algorithms in selection are from B2OC. The builders D02KmPimPipPipCombiner_xxx, D02KpPipPimPimCombiner_xxx, D02PipPipPimPimCombiner_xxx, D02KpKmPipPimCombiner_xxx, Ds2KKPiCombiner_xxx, Xic02PKKPiCombiner_xxx, Omegac02PKKPiCombiner_xxx in sum give 218.4s/3358s = 6.5% of all cpu-usage.

See a comparison of a proposed modifications with default version (master) tested locally at lxplus on 1000 events:

  • log for default (master): hlt2_b2oc_speedup_1k.log
  • log for "partial" speedup, where cuts on F.M are added to 12&123 combiner cuts, but without splitting of make_threebody/fourbody functions: hlt2_b2oc_speedup-v0_1k.log
  • log of "full" speedup, where in addition to cuts on F.M added to 12&123 combiner cuts make_threebody/fourbody functions are split in two stages (combination and filtering) to reduce number of time combinatorics is done: hlt2_b2oc_speedup-master_1k.log
combiner default (master) "partial" speed-up "full" speed-up
D02KmPimPipPipCombiner_xxx 2.93+1.56+1.56+1.56+1.55+1.39+0.91=11.46s 1.14+0.67+0.65+0.65+0.61+0.59+0.37 = 4.68s 1.14+0.65+0.59+0.59+0.41=3.38s
D02KpPipPimPimCombiner_xxx 2.79+1.52+1.51+1.49+1.48+1.34+0.90=11.03s 1.13+0.68+0.66+0.65+0.60+0.59+0.38 = 4.69s 1.12+0.65+0.59+0.59+0.37 = 3.32s
Xic02PKKPiCombiner_xxx 2.37+0.23 = 2.60s 0.71+0.08 = 0.79s 0.71+0.08 = 0.79s
Omegac02PKKPiCombiner_xxx 2.34+0.24 = 2.58s 1.00+0.09 = 1.09s 0.98+0.10 = 1.08s
Ds2KKPiCombiner_xxx 0.83+0.78+0.77 = 2.38s 0.70+0.65+0.65 = 2.00s 1.14+0.65 = 1.79s
sum 30.05s 13.25s 10.36s

Thus, "full" speedup reduces cpu-usage of corresponding algorithms by 66%. Thus, if recalculating to overall cpu-usage in hlt2 the numbers 4.3% and 10.2% at the top are obtained. As more lines are possibly affected the actual reduction might be even more.

small bonus, as suggested by @mstahl (and with input from @gligorov), naive recalculation into energy saved can be estimated as:

  • taking typical data taking year as 10h/day * 165days/year = 1650 hours.
  • taking typical power consumption either as
    • 4000 of E5-2630 nodes * 320W/node = 1.3 MW (according to @gligorov);
    • or as (3358s/20 threads) / 20k events * 1MHz * 85W/node = 0.7MW; see spec

gives 1160-2150 MWh / year. Taking energy price for France as 206 eur/MWh it results in 240-440k eur/year.

Thus reducing consumption by 4.3% gives 10-20k eur of savings a year, to be multiplied by ~3 for the whole Run3.

Edited by Ivan Polyakov

Merge request reports