Evaluate possible speedups in LogicalBorderSurface
I am creating this issue to describe a few performance tests I am doing to evaluate possible speedups by alternative implementations of LogicalBorderSurface. This is related to my draft MR !90 (merged).
This work is a follow-up to Vincenzo Innocente's presentations at the HEP-SCORE workshop in September 2022 https://indico.cern.ch/event/1170924/contributions/4951098/ He had found out that in his tests 40% of the time was spent in G4LogicalBorderSurface::GetSurface and he had suggested that the G4 10.6 implementation (using vectors) is inefficient and has already been improved in G4 10.7 (using maps). He also suggested that even the 10.7 implementation can be further improved. Many thanks to Vincenzo again!
My draft MR !90 (merged) essentially uses G4 10.6 as a baseline for everything in G4, but replaces only a few files and functions by picking up the G4 10.7 implementation. This includes some G4 public headers that are used in Gauss, so a Gauss rebuild is neeeded.
Here I discuss some simple performance tests comparing the official LHCb/Geant4 tag v10r6p2t6 (the same as the current master) with my branch valassi_v10r6patches that is used in my MR.
The test I am using is that used in the latest benchmark container lhcb-sim-run3 for the HEP-SCORE project https://gitlab.cern.ch/valassi/hep-workloads/-/blob/4c26f8b4765cde729ad7baba35ac83a515f5d656/lhcb/sim-run3/lhcb-sim-run3/lhcb-sim-run3-bmk.sh NB this is a more recent version than that used in Vincenzo's tests in September (lhcb-gen-sim-2021). It is therefore possible that some seedups have already materialised elsewhere.
For simplicity, my tests are not based on the current v0.1 HEP-SCORE candidate, which uses Gauss official release v56r2, but on the slightly modified version that is bening prototyped for ARM, see Gauss#87 (closed). This is itself based on Marco's recipes in https://codimd.web.cern.ch/Urg1_B6RQPeXRNQUpiVtGQ?view#
This is what I do on my centos7 pmpe04 node to build the "old" code
cd /data/avalassi/LHCb2023
git clone https://gitlab.cern.ch/lhcb/upgrade-hackathon-setup.git workspace
echo platform=x86_64_v2-centos7-gcc11-opt >> workspace/configuration.mk
echo LCG_VERSION=102b >> workspace/configuration.mk
echo WITH_GITCONDDB = 1 >> workspace/configuration.mk
echo PROJECTS += GitCondDB >> workspace/configuration.mk
cd workspace
. /cvmfs/lhcb.cern.ch/lib/LbEnv
lb-set-platform x86_64_v2-centos7-gcc11-opt
export LCG_VERSION=102b
git clone -b add-aarch64 https://gitlab.cern.ch/lhcb-core/lcg-toolchains.git
git clone https://gitlab.cern.ch/lhcb-core/mirrors/Catch2.git -b v2.13.10
cmake -S Catch2 -B Catch2/build.$BINARY_TAG -DCMAKE_TOOLCHAIN_FILE=${PWD}/lcg-toolchains/LCG_${LCG_VERSION}/$BINARY_TAG.cmake -DCATCH_ENABLE_WERROR=NO -DCATCH_BUILD_STATIC_LIBRARY=YES -DCMAKE_INSTALL_PREFIX=${PWD}/Catch2/InstallArea/$BINARY_TAG -DBUILD_TESTING=NO -GNinja
cmake --build Catch2/build.$BINARY_TAG --target install
git clone https://gitlab.cern.ch/lhcb-core/mirrors/yaml-cpp.git -b yaml-cpp-0.7.0
cmake -S yaml-cpp -B yaml-cpp/build.$BINARY_TAG -DCMAKE_TOOLCHAIN_FILE=${PWD}/lcg-toolchains/LCG_${LCG_VERSION}/$BINARY_TAG.cmake -DCMAKE_INSTALL_PREFIX=${PWD}/yaml-cpp/InstallArea/$BINARY_TAG -DYAML_BUILD_SHARED_LIBS=ON -DBUILD_TESTING=NO -GNinja
cmake --build yaml-cpp/build.$BINARY_TAG --target install
git clone https://gitlab.cern.ch/lhcb-core/mirrors/DD4hep.git -b v01-23
(
. /cvmfs/sft.cern.ch/lcg/views/LCG_${LCG_VERSION}/x86_64-centos7-gcc11-opt/setup.sh
cmake -S DD4hep -B DD4hep/build.$BINARY_TAG -DCMAKE_INSTALL_PREFIX=${PWD}/DD4hep/InstallArea/$BINARY_TAG -DBUILD_TESTING=NO -GNinja -DCMAKE_CXX_STANDARD=17 -DDD4HEP_USE_XERCESC=ON -DDD4HEP_USE_GEANT4=OFF -DDD4HEP_USE_TBB=ON -DDD4HEP_BUILD_PACKAGES="DDRec DDDetectors DDCond DDAlign"
cmake --build DD4hep/build.$BINARY_TAG --target install
)
git clone ssh://git@gitlab.cern.ch:7999/gaudi/Gaudi.git
git clone ssh://git@gitlab.cern.ch:7999/lhcb/GitCondDB.git
git clone ssh://git@gitlab.cern.ch:7999/lhcb/Detector.git
git clone ssh://git@gitlab.cern.ch:7999/lhcb/LHCb.git
git clone ssh://git@gitlab.cern.ch:7999/lhcb/Run2Support.git
git clone ssh://git@gitlab.cern.ch:7999/lhcb/Geant4.git
git clone ssh://git@gitlab.cern.ch:7999/lhcb/Gauss.git
cd Gaudi; git checkout master; cd -
cd GitCondDB; git checkout master; cd -
cd Detector; git checkout v1r2; cd -
cd LHCb; git checkout valassi_sim10_aarch64; cd -
cd Run2Support; git checkout master; cd -
cd Geant4; git checkout v10r6p2t6; cd -
cd Gauss; git checkout valassi_sim10_aarch64_lhcb7; cd -
make Gaudi
make GitCondDB
make Detector
make LHCb
make Run2Support
make Geant4
make Gauss
Then I run the same test as in HEP-SCORE
cd /data/avalassi/LHCb2023/workspace/Gauss
wget https://gitlab.cern.ch/valassi/hep-workloads/-/raw/qa-build-lhcb-sim-run3/lhcb/sim-run3/lhcb-sim-run3/prodConf_Gauss_0bmk2023_00000726_1.py
cat prodConf_Gauss_0bmk2023_00000726_1.py | sed -e "s/NOfEvents=5/NOfEvents=25/" > prodConf025evt.py
PRODCONFROOT=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5 PYTHONPATH=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5/python ./run gaudirun.py -T '$APPCONFIGOPTS/Gauss/Beam6800GeV-mu100-2022-nu3.2.py' '$DECFILESROOT/options/10000000.py' '$LBPYTHIA8ROOT/options/Pythia8.py' '$APPCONFIGOPTS/Gauss/Run3-detector.py' '$APPCONFIGOPTS/Gauss/DataType-Upgrade.py' '$APPCONFIGOPTS/Gauss/G4PL_FTFP_BERT_EmOpt2.py' '$APPCONFIGOPTS/Persistency/Compression-ZLIB-1.py' prodConf025evt.py 2>&1 | tee out025evt_old.log
For the "alternative" code, I simply change the Geant4 branch
cp -dpr workspace workspaceALT
cd workspaceALT
cd Geant4; git checkout valassi_v10r6patches; cd -
make Geant4
make Gauss
cd Gauss
PRODCONFROOT=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5 PYTHONPATH=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5/python ./run gaudirun.py -T '$APPCONFIGOPTS/Gauss/Beam6800GeV-mu100-2022-nu3.2.py' '$DECFILESROOT/options/10000000.py' '$LBPYTHIA8ROOT/options/Pythia8.py' '$APPCONFIGOPTS/Gauss/Run3-detector.py' '$APPCONFIGOPTS/Gauss/DataType-Upgrade.py' '$APPCONFIGOPTS/Gauss/G4PL_FTFP_BERT_EmOpt2.py' '$APPCONFIGOPTS/Persistency/Compression-ZLIB-1.py' prodConf025evt.py 2>&1 | tee out025evt_alt.log
First of all, the physics results are identical. This script gives NO physics differences
./diffLogs.sh out025evt_old.log out025evt_alt.log
Where diffLogs.sh is simply masking out all dates and timing information
#!/bin/bash
if [ "$2" == "" ] || [ "$3" != "" ]; then
echo "Usage: $0 <oldlog> <newlog>"
exit 1
fi
###cat $1 | sed "s/^2023.*UTC/2023-mm-dd hh:mm:ss.fff UTC/" > ${1}.tmp
###cat $2 | sed "s/^2023.*UTC/2023-mm-dd hh:mm:ss.fff UTC/" > ${2}.tmp
###tkdiff ${1}.tmp ${2}.tmp &
cat $1 | sed "s/^2023.*UTC/2023-mm-dd hh:mm:ss.fff UTC/" | grep -v TimingAuditor > ${1}.tmp2
cat $2 | sed "s/^2023.*UTC/2023-mm-dd hh:mm:ss.fff UTC/" | grep -v TimingAuditor > ${2}.tmp2
diff ${1}.tmp2 ${2}.tmp2
To analyse the performance I use the HEP-SCORE parser
wget https://gitlab.cern.ch/valassi/hep-workloads/-/raw/qa-build-lhcb-sim-run3/lhcb/sim-run3/lhcb-sim-run3/parseResults.py
python -c "from parseResults import *; parseLogTxt('out025evt_old.log')" | \grep '\[SKIP FIRST\] Total time' | tail -1
python -c "from parseResults import *; parseLogTxt('out025evt_alt.log')" | \grep '\[SKIP FIRST\] Total time' | tail -1
At face value this gives:
[SKIP FIRST] Total time[sec] 1128.4039999999998
[SKIP FIRST] Total time[sec] 1070.5489999999998
So the difference is 58/1128 seconds which is an average speedup of 5.1% on 25 events.
More in detail, with a script diffEvts.sh
#!/bin/bash
if [ "$2" == "" ] || [ "$3" != "" ]; then
echo "Usage: $0 <oldlog> <newlog>"
exit 1
fi
python -c "from parseResults import *; parseLogTxt('$1')" \
| awk /^iCPU=00.*SIM/'{gsub(",","",$0);printf("%5d %8.3f\n",$4,$3)}' > ${1}.tmp
python -c "from parseResults import *; parseLogTxt('$2')" \
| awk /^iCPU=00.*SIM/'{gsub(",","",$0);printf("%5d %8.3f\n",$4,$3)}' > ${2}.tmp
paste ${1}.tmp ${2}.tmp | awk '{printf("%5d %8.3f %8.3f %8.3f %5.1f%\n",$1,$2,$4,$2-$4,($2-$4)/$2*100)}' | tail -n+2
This gives
./diffEvts.sh out025evt_old.log out025evt_alt.log
1002 33.352 33.642 -0.290 -0.9%
1003 67.889 68.117 -0.228 -0.3%
1004 36.862 35.994 0.868 2.4%
1005 41.743 36.787 4.956 11.9%
1006 56.214 47.959 8.255 14.7%
1007 56.668 43.755 12.913 22.8%
1008 24.893 19.689 5.204 20.9%
1009 118.020 105.878 12.142 10.3%
1010 55.584 54.131 1.453 2.6%
1011 52.777 50.142 2.635 5.0%
1012 56.410 56.538 -0.128 -0.2%
1013 53.080 53.067 0.013 0.0%
1014 56.097 55.655 0.442 0.8%
1015 30.779 30.627 0.152 0.5%
1016 40.514 39.164 1.350 3.3%
1017 42.963 42.530 0.433 1.0%
1018 56.348 54.502 1.846 3.3%
1019 32.976 32.439 0.537 1.6%
1020 18.341 17.441 0.900 4.9%
1021 32.318 32.348 -0.030 -0.1%
1022 15.729 15.657 0.072 0.5%
1023 28.431 26.143 2.288 8.0%
1024 72.346 70.982 1.364 1.9%
1025 48.070 47.362 0.708 1.5%
In other words
- in average on this first test I got a 5% speedup on 25 events
- but the speedup of the new implementation varies a lot from one event to another
- on some events the new implementation may be even slightly slower (up to 1%)
- but in general it is faster, and on some events it is up to 23% faster
Of course this was only the very first test. I am now rerunning exactly the same test with 25 events, to see how reproducible this result is.
Next steps:
- I must repeat these tests even further to see if the numbers are reproducible!
- Choosing the right number of events and the right events for HEP-SCORE can be tricky.
- I should also run a profiler (as Vincenzo had done)
- If this is confirmed and there is interest to make a release, first one could try Vincenzo's further suggestions insteda of the G4 10.7 map.