Evaluate possible speedups in LogicalBorderSurface

assigned to @valassi

Well, these numbers are not reproducible. Very strange, I do not understand these huge fluctuations. This should be a physical machine with no load on it.

This is the outcome of a second "run2" on 25 events, very different from the previous one: the old implementation is slightly faster!

  python -c "from parseResults import *; parseLogTxt('out025evt_old_run2.log')" | \grep '\[SKIP FIRST\] Total time' | tail -1
  python -c "from parseResults import *; parseLogTxt('out025evt_alt_run2.log')" | \grep '\[SKIP FIRST\] Total time' | tail -1
   [SKIP FIRST] Total time[sec] 1091.568
   [SKIP FIRST] Total time[sec] 1107.4219999999998

 ./diffEvts.sh out025evt_old_run2.log out025evt_alt_run2.log 
 1002   33.827   37.416   -3.589 -10.6%
 1003   68.011   75.599   -7.588 -11.2%
 1004   36.909   39.699   -2.790  -7.6%
 1005   40.125   40.415   -0.290  -0.7%
 1006   51.430   53.188   -1.758  -3.4%
 1007   46.276   49.109   -2.833  -6.1%
 1008   20.811   22.109   -1.298  -6.2%
 1009  110.673  111.179   -0.506  -0.5%
 1010   55.543   52.699    2.844   5.1%
 1011   52.864   50.283    2.581   4.9%
 1012   59.466   56.402    3.064   5.2%
 1013   55.699   53.406    2.293   4.1%
 1014   55.950   55.839    0.111   0.2%
 1015   30.325   30.675   -0.350  -1.2%
 1016   38.766   39.443   -0.677  -1.7%
 1017   41.152   41.879   -0.727  -1.8%
 1018   54.066   54.796   -0.730  -1.4%
 1019   31.981   32.468   -0.487  -1.5%
 1020   17.289   17.551   -0.262  -1.5%
 1021   32.082   32.510   -0.428  -1.3%
 1022   15.462   15.672   -0.210  -1.4%
 1023   25.865   26.319   -0.454  -1.8%
 1024   70.111   71.201   -1.090  -1.6%
 1025   46.885   47.565   -0.680  -1.5%

And this is the result of a longer run on 50 events, where the first 25 are the same 25 as in the previous two runs. The new implementation is maybe 2% faster on average, but not more.

  python -c "from parseResults import *; parseLogTxt('out050evt_old.log')" | \grep '\[SKIP FIRST\] Total time' | tail -1
  python -c "from parseResults import *; parseLogTxt('out050evt_alt.log')" | \grep '\[SKIP FIRST\] Total time' | tail -1
  [SKIP FIRST] Total time[sec] 2225.483
  [SKIP FIRST] Total time[sec] 2171.8549999999996

   ./diffEvts.sh out050evt_old.log out050evt_alt.log 
   1002   34.164   33.197    0.967   2.8%
   1003   76.262   67.379    8.883  11.6%
   1004   40.189   35.640    4.549  11.3%
   1005   41.069   37.418    3.651   8.9%
   1006   53.830   50.119    3.711   6.9%
   1007   49.511   46.168    3.343   6.8%
   1008   22.264   20.768    1.496   6.7%
   1009  116.810  110.511    6.299   5.4%
   1010   56.210   54.942    1.268   2.3%
   1011   50.016   50.478   -0.462  -0.9%
   1012   56.068   55.886    0.182   0.3%
   1013   52.958   52.609    0.349   0.7%
   1014   55.547   55.113    0.434   0.8%
   1015   30.525   30.273    0.252   0.8%
   1016   39.190   38.868    0.322   0.8%
   1017   41.574   41.282    0.292   0.7%
   1018   54.577   54.034    0.543   1.0%
   1019   32.310   32.100    0.210   0.6%
   1020   17.426   17.314    0.112   0.6%
   1021   32.338   32.136    0.202   0.6%
   1022   15.587   15.499    0.088   0.6%
   1023   26.209   25.971    0.238   0.9%
   1024   70.803   70.313    0.490   0.7%
   1025   47.285   46.970    0.315   0.7%
   1026   42.014   41.375    0.639   1.5%
   1027   65.871   64.870    1.001   1.5%
   1028   12.233   11.971    0.262   2.1%
   1029   27.719   27.205    0.514   1.9%
   1030   30.098   29.489    0.609   2.0%
   1031   22.855   22.442    0.413   1.8%
   1032   32.626   31.819    0.807   2.5%
   1033   62.108   61.244    0.864   1.4%
   1034   34.902   34.257    0.645   1.8%
   1035   27.901   27.474    0.427   1.5%
   1036   23.622   23.196    0.426   1.8%
   1037   20.915   20.520    0.395   1.9%
   1038   42.326   41.714    0.612   1.4%
   1039   17.285   16.980    0.305   1.8%
   1040   86.530   85.339    1.191   1.4%
   1041  114.103  112.330    1.773   1.6%
   1042   41.391   40.821    0.570   1.4%
   1043   45.606   45.107    0.499   1.1%
   1044   25.672   25.257    0.415   1.6%
   1045   50.873   50.450    0.423   0.8%
   1046  105.361  104.191    1.170   1.1%
   1047   57.911   57.508    0.403   0.7%
   1048   62.756   62.025    0.731   1.2%
   1049   27.304   27.131    0.173   0.6%
   1050   32.779   32.152    0.627   1.9%

I should maybetry to reproduce these numbers on a more stable machine, or understand why it they are so unstable. In any case there is no clear evidence either way at this point in time.

mentioned in merge request !90 (merged)

I did a first quick profiling test, using the old software. On 5 events there is 25% of the time spent in G4LogicalBorderSurface::GetSurface. Vincenzo had reported 40% spent there, but this probably depends on which events are used (and how the profiling is done, with which frequency).

Note that, as expected, the GetSurface calls are in the RICH Cerenkov.

  cd /data/avalassi/LHCb2023/workspace/Gauss
  PRODCONFROOT=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5 PYTHONPATH=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5/python ./run perf record -o perf.data -F 100 -g --  gaudirun.py -T '$APPCONFIGOPTS/Gauss/Beam6800GeV-mu100-2022-nu3.2.py' '$DECFILESROOT/options/10000000.py' '$LBPYTHIA8ROOT/options/Pythia8.py' '$APPCONFIGOPTS/Gauss/Run3-detector.py' '$APPCONFIGOPTS/Gauss/DataType-Upgrade.py' '$APPCONFIGOPTS/Gauss/G4PL_FTFP_BERT_EmOpt2.py' '$APPCONFIGOPTS/Persistency/Compression-ZLIB-1.py' prodConf005evt.py 2>&1 | tee out005evt_old_perf2.log

  perf report --stdio
...
    25.63%     0.00%  python           libGaussCherenkov.so            [.] CkvG4OpBoundaryProcess::~CkvG4OpBoundaryProcess
            |
            ---CkvG4OpBoundaryProcess::~CkvG4OpBoundaryProcess
               0x3e8ef00
               |          
                --24.80%--G4LogicalBorderSurface::GetSurface

    24.88%    24.72%  python           libG4geometry.so                [.] G4LogicalBorderSurface::GetSurface
            |          
             --24.66%--0x6de907894810c083
                       CkvG4OpBoundaryProcess::~CkvG4OpBoundaryProcess
                       0x3e8ef00
                       |          
                        --24.64%--G4LogicalBorderSurface::GetSurface

Strange however that 25% of the time is spent in a destructor? CkvG4OpBoundaryProcess::~CkvG4OpBoundaryProcess

I retried perf also on the new implementation, with (not surprisingly) very similar results

  cd /data/avalassi/LHCb2023/workspaceALT/Gauss
  PRODCONFROOT=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5 PYTHONPATH=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5/python ./run perf record -o perf.data -F 100 -g --  gaudirun.py -T '$APPCONFIGOPTS/Gauss/Beam6800GeV-mu100-2022-nu3.2.py' '$DECFILESROOT/options/10000000.py' '$LBPYTHIA8ROOT/options/Pythia8.py' '$APPCONFIGOPTS/Gauss/Run3-detector.py' '$APPCONFIGOPTS/Gauss/DataType-Upgrade.py' '$APPCONFIGOPTS/Gauss/G4PL_FTFP_BERT_EmOpt2.py' '$APPCONFIGOPTS/Persistency/Compression-ZLIB-1.py' prodConf005evt.py 2>&1 | tee out005evt_alt_perf2.log

  perf report --stdio
  ...
    27.43%     0.00%  python           libGaussCherenkov.so            [.] CkvG4OpBoundaryProcess::~CkvG4OpBoundaryProcess
            |
            ---CkvG4OpBoundaryProcess::~CkvG4OpBoundaryProcess
               0x92aef00
               |          
                --26.71%--G4LogicalBorderSurface::GetSurface

    26.82%    26.72%  python           libG4geometry.so                [.] G4LogicalBorderSurface::GetSurface
            |          
             --26.62%--0x6de907894810c083
                       CkvG4OpBoundaryProcess::~CkvG4OpBoundaryProcess
                       0x92aef00
                       |          
                        --26.60%--G4LogicalBorderSurface::GetSurface

I am moreand more puzzled by the fact that the GetSurface calls taking up 30% of the time seem to come from the destructor CkvG4OpBoundaryProcess::~CkvG4OpBoundaryProcess

By the way, there is not even an explicit delete so I am not even sure who would trigger this dtor, but I will assume that the delete is done by the G4Manager,

Sim/GaussCherenkov/src/CherenkovPhysProcess/GiGaPhysConstructorOpCkv.cpp:         pmanager->AddDiscreteProcess(theBoundaryProcess);

Anyway, this is an empty dtor

  Sim/GaussCherenkov/src/srcG4Ckv/CkvG4OpBoundaryProcess.cc:CkvG4OpBoundaryProcess::~CkvG4OpBoundaryProcess(){}

The class has some private data members, but none that looks like a complex type whose dtor requires GetSurface calls. The class is derived from G4VDiscreteProcess which is derived from G4VProcess, but also these two G4 classes have empty dtors and do not seem to have complex data types.

My only guess at this point is that perf is missing some frames, as it often happens. I am having trouble with dwarf however and I would not want to rebuild with -fno-omit-frame-pointer.

I removed -g, at least the dtor has disappeared...

  cd /data/avalassi/LHCb2023/workspace/Gauss
  PRODCONFROOT=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5 PYTHONPATH=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5/python ./run perf record -o perf.data -F 100 -- gaudirun.py -T '$APPCONFIGOPTS/Gauss/Beam6800GeV-mu100-2022-nu3.2.py' '$DECFILESROOT/options/10000000.py' '$LBPYTHIA8ROOT/options/Pythia8.py' '$APPCONFIGOPTS/Gauss/Run3-detector.py' '$APPCONFIGOPTS/Gauss/DataType-Upgrade.py' '$APPCONFIGOPTS/Gauss/G4PL_FTFP_BERT_EmOpt2.py' '$APPCONFIGOPTS/Persistency/Compression-ZLIB-1.py' prodConf005evt.py 2>&1 | tee out005evt_old_perf3.log

  perf report --stdio
  ...
    24.85%  python           libG4geometry.so                [.] G4LogicalBorderSurface::GetSurface

Removing --stdio and using 'annotate', note this

G4LogicalBorderSurface::GetSurface  /data/avalassi/LHCb2023/workspace/Geant4/InstallArea/x86_64_v2-centos7-gcc11-opt/lib/li
Percent│    G4LogicalBorderSurface::GetSurface(G4VPhysicalVolume const*, G4VPhysicalVolume const*):
  0.04 │      mov    G4LogicalBorderSurface::theBorderSurfaceTable@@Base-0xa5b0,%rax
  0.05 │      mov    (%rax),%r8
       │      test   %r8,%r8
       │    → je     263f08 <G4LogicalBorderSurface::GetSurface(G4VPhysicalVolume const*, G4VPhysicalVolume const*)+0x38>
  0.03 │      mov    (%r8),%rax
       │      mov    0x8(%r8),%rdx
       │      cmp    %rdx,%rax
       │    → jne    263ef9 <G4LogicalBorderSurface::GetSurface(G4VPhysicalVolume const*, G4VPhysicalVolume const*)+0x29>
       │    → jmp    263f10 <G4LogicalBorderSurface::GetSurface(G4VPhysicalVolume const*, G4VPhysicalVolume const*)+0x40>
       │      nop
  5.51 │      add    $0x8,%rax
 10.52 │      cmp    %rax,%rdx
       │    → je     263f10 <G4LogicalBorderSurface::GetSurface(G4VPhysicalVolume const*, G4VPhysicalVolume const*)+0x40>
 15.55 │      mov    (%rax),%r8
 59.80 │      cmp    0x38(%r8),%rdi
  8.35 │    → jne    263ef0 <G4LogicalBorderSurface::GetSurface(G4VPhysicalVolume const*, G4VPhysicalVolume const*)+0x20>
  0.01 │      cmp    %rsi,0x40(%r8)
  0.04 │    → jne    263ef0 <G4LogicalBorderSurface::GetSurface(G4VPhysicalVolume const*, G4VPhysicalVolume const*)+0x20>
  0.01 │      mov    %r8,%rax
       │      ret
       │      nop
       │      xor    %r8d,%r8d
       │      mov    %r8,%rax
  0.07 │      ret

Well well... BUG in my procedure. (I found it out by seeing that in workspaceALT the perf report annotate points also to the old workspace...)

This was not enough

cp -dpr workspace workspaceALT
  cd workspaceALT
  cd Geant4; git checkout valassi_v10r6patches; cd -
  make Geant4
  make Gauss

Apparently the right code was used to build, but then the "run" environment was pointing to the old code.

IGNORE ALL RESULT COMPARISONS ABOVE - I WAS ALWAYS RUNNING THE OLD CODE

Now rebuilding all in two separate windows, in two nicely separate directories workspaceOLD and workspaceALT (there s no generic workspace anymore)

  cd /data/avalassi/LHCb2023/workspaceOLD
  . /cvmfs/lhcb.cern.ch/lib/LbEnv
  lb-set-platform x86_64_v2-centos7-gcc11-opt
  export LCG_VERSION=102b
  rm -rf */build.x86_64_v2-centos7-gcc11-opt/
  make Gaudi
  make GitCondDB
  make Detector
  make LHCb
  make Run2Support
  make Geant4
  make Gauss

and (do not forget to use my patch!)

  cd /data/avalassi/LHCb2023/workspaceALT
  cd Geant4; git checkout valassi_v10r6patches; cd -
  . /cvmfs/lhcb.cern.ch/lib/LbEnv
  lb-set-platform x86_64_v2-centos7-gcc11-opt
  export LCG_VERSION=102b
  rm -rf */build.x86_64_v2-centos7-gcc11-opt/
  make Gaudi
  make GitCondDB
  make Detector
  make LHCb
  make Run2Support
  make Geant4
  make Gauss

Strangely, both of them are rebuilding everything without using .ccache... not clear why. I thought ccache was using file contents rather than file names...

Now, this does look interesting :-)

This is saving 40 to 50% of the time (i.e. it is ALMOST A FACTOR TWO FASTER)

  cd /data/avalassi/LHCb2023/workspaceALT/Gauss
  PRODCONFROOT=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5 PYTHONPATH=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5/python ./run gaudirun.py -T '$APPCONFIGOPTS/Gauss/Beam6800GeV-mu100-2022-nu3.2.py' '$DECFILESROOT/options/10000000.py' '$LBPYTHIA8ROOT/options/Pythia8.py' '$APPCONFIGOPTS/Gauss/Run3-detector.py' '$APPCONFIGOPTS/Gauss/DataType-Upgrade.py' '$APPCONFIGOPTS/Gauss/G4PL_FTFP_BERT_EmOpt2.py' '$APPCONFIGOPTS/Persistency/Compression-ZLIB-1.py' prodConf005evt.py 2>&1 | tee out005evt_ALT.log

  python -c "from parseResults import *; parseLogTxt('out005evt_old.log')" | \grep '\[SKIP FIRST\] Total time' | tail -1
  python -c "from parseResults import *; parseLogTxt('out005evt_ALT.log')" | \grep '\[SKIP FIRST\] Total time' | tail -1
  [SKIP FIRST] Total time[sec] 178.74200000000002
  [SKIP FIRST] Total time[sec] 92.61099999999999

  ./diffEvts.sh out005evt_old.log out005evt_ALT.log 
 1002   34.668   19.348   15.320  44.2%
 1003   70.166   35.533   34.633  49.4%
 1004   36.506   18.594   17.912  49.1%
 1005   37.402   19.136   18.266  48.8%

And

  ./diffLogs.sh out005evt_old.log out005evt_ALT.log

gives no physics differences. It seems to be twice faster, with bit-by-bit the same physics results

And perf

  cd /data/avalassi/LHCb2023/workspaceALT/Gauss
  PRODCONFROOT=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5 PYTHONPATH=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5/python ./run perf record -o perf.data -F 100 -- gaudirun.py -T '$APPCONFIGOPTS/Gauss/Beam6800GeV-mu100-2022-nu3.2.py' '$DECFILESROOT/options/10000000.py' '$LBPYTHIA8ROOT/options/Pythia8.py' '$APPCONFIGOPTS/Gauss/Run3-detector.py' '$APPCONFIGOPTS/Gauss/DataType-Upgrade.py' '$APPCONFIGOPTS/Gauss/G4PL_FTFP_BERT_EmOpt2.py' '$APPCONFIGOPTS/Persistency/Compression-ZLIB-1.py' prodConf005evt.py 2>&1 | tee out005evt_ALT_perf3.log
   perf report --stdio
     9.49%  python           libCLHEP-2.4.5.1.so             [.] CLHEP::RanluxEngine::flat
     3.31%  python           libG4processes.so               [.] G4ProductionCutsTable::ScanAndSe
     2.86%  python           libm-2.17.so                    [.] __ieee754_log_avx
     2.30%  python           libCLHEP-2.4.5.1.so             [.] CLHEP::RanluxEngine::flatArray
     2.07%  python           libstdc++.so.6.0.29             [.] __cxxabiv1::__vmi_class_type_inf

The GetSurface HAS DISAPPEARED!

Ok this definitely looks good. I have rerun during the night over 300 events.

The new implementation is a factor 2 (x2.00) faster on average and the physics logs are bit-by-bit identical.

(More in detail, each individual event goes between x1.6 and x2.3 faster)

  cd /data/avalassi/LHCb2023/workspaceOLD/Gauss
  cat prodConf_Gauss_0bmk2023_00000726_1.py | sed -e "s/NOfEvents=5/NOfEvents=300/" > prodConf300evt.py
  cd /data/avalassi/LHCb2023/workspaceALT/Gauss
  cat prodConf_Gauss_0bmk2023_00000726_1.py | sed -e "s/NOfEvents=5/NOfEvents=300/" > prodConf300evt.py

  cd /data/avalassi/LHCb2023/workspaceOLD/Gauss
  PRODCONFROOT=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5 PYTHONPATH=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5/python ./run gaudirun.py -T '$APPCONFIGOPTS/Gauss/Beam6800GeV-mu100-2022-nu3.2.py' '$DECFILESROOT/options/10000000.py' '$LBPYTHIA8ROOT/options/Pythia8.py' '$APPCONFIGOPTS/Gauss/Run3-detector.py' '$APPCONFIGOPTS/Gauss/DataType-Upgrade.py' '$APPCONFIGOPTS/Gauss/G4PL_FTFP_BERT_EmOpt2.py' '$APPCONFIGOPTS/Persistency/Compression-ZLIB-1.py' prodConf300evt.py 2>&1 | tee out300evt_OLD.log
  cd /data/avalassi/LHCb2023/workspaceALT/Gauss
  PRODCONFROOT=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5 PYTHONPATH=/cvmfs/lhcb.cern.ch/lib/lhcb/DBASE/ProdConf/v3r5/python ./run gaudirun.py -T '$APPCONFIGOPTS/Gauss/Beam6800GeV-mu100-2022-nu3.2.py' '$DECFILESROOT/options/10000000.py' '$LBPYTHIA8ROOT/options/Pythia8.py' '$APPCONFIGOPTS/Gauss/Run3-detector.py' '$APPCONFIGOPTS/Gauss/DataType-Upgrade.py' '$APPCONFIGOPTS/Gauss/G4PL_FTFP_BERT_EmOpt2.py' '$APPCONFIGOPTS/Persistency/Compression-ZLIB-1.py' prodConf300evt.py 2>&1 | tee out300evt_ALT.log

  python -c "from parseResults import *; parseLogTxt('out300evt_OLD.log')" | \grep '\[SKIP FIRST\] Total time' | tail -1
  python -c "from parseResults import *; parseLogTxt('out300evt_ALT.log')" | \grep '\[SKIP FIRST\] Total time' | tail -1
    [SKIP FIRST] Total time[sec] 13937.614999999989
    [SKIP FIRST] Total time[sec] 6956.6720000000005

mentioned in issue Gauss#87 (closed)

Thanks to Gloria, Dima, Marco and others, the performance sspeedup has been confirmed and this has been validated and merged !90 (merged)

The issue can be closed

closed

Evaluate possible speedups in LogicalBorderSurface

Designs

Child items ...

Activity