Fully SIMD vectorised RICH photon reconstruction
Builds on LHCb!933 (merged) and Lbcom!190 (merged)
Firstly I have, as in Lbcom, retired the runtime CPU capabilities detection and dispatching code. Instead I simply rely on the build platform settings to determine the SIMD level. This is necessary to allow the use of SIMD (Vc
) types more generally, rather than in very localised places.
Second, I have added new development algorithms to start fully vectorising the RICH reconstruction. So far these are
-
A pixel algorithm that takes the scalar information and summarises it into SIMD TES data objects. Longer term this algorithm might well be retired itself, as once everything is based on SIMD types the first step making the scalar versions can be removed. For the moment though its useful to have both scalar and SIMD versions.
-
An (almost) fully SIMD vectorised (using
Vc
andGenVector
types) quartic photon reconstruction algorithm is provided. -
An (almost) fully SIMD vectorised version of the photon pixel probability (the next in line after the quartic algorithm).
I say almost in 2. and 3. as there are a few places where I have had to resort back to scalar loops over the SIMD types to perform some calculations, where I have yet to see any obvious way to do it fully vectorised. Generally this happens for instance when I have to follow a pointer to say a mirror segment, for each 'scalar' photon.
The results so far are looking quite good. For an SSE4.2 build (the default) I get
RichPhotonRecoLong | 27.060 | 27.468 | 0.311 218.9 25.96 | 1000 | 27.469 |
RichPredPixelSignalLong | 2.750 | 2.686 | 0.036 20.0 2.38 | 1000 | 2.687 |
RichSIMDPhotonRecoLong | 14.220 | 14.168 | 0.194 110.1 12.99 | 1000 | 14.169 |
RichSIMDPredPixelSignalLong | 1.690 | 1.621 | 0.027 11.9 1.41 | 1000 | 1.622 |
where the first two are the scalar versions and the last two the SIMD (SSE4.2).
If I instead build my stack allowing AVX2+FMA
I get
RichPhotonRecoLong | 23.100 | 22.873 | 0.263 184.1 21.60 | 1000 | 22.874 |
RichPredPixelSignalLong | 2.310 | 2.301 | 0.032 15.6 2.03 | 1000 | 2.302 |
RichSIMDPhotonRecoLong | 8.160 | 8.508 | 0.140 63.5 7.60 | 1000 | 8.508 |
RichSIMDPredPixelSignalLong | 1.370 | 1.473 | 0.027 10.6 1.22 | 1000 | 1.474 |
So the factor 2 increase in the SIMD vector size (4 to 8 floats) is seen. Its not quite perfect, but I have ideas as to why this is... ( Note the scalar version for AVX2+FMA is already faster, as it is able to gain from the FMA part...).