Fully SIMD vectorised RICH photon reconstruction (!779) · Merge requests · LHCb / Rec

Christopher Rob Jones requested to merge RichFuture-SIMDPhotonReco into master Nov 01, 2017

Builds on LHCb!933 (merged) and Lbcom!190 (merged)

Firstly I have, as in Lbcom, retired the runtime CPU capabilities detection and dispatching code. Instead I simply rely on the build platform settings to determine the SIMD level. This is necessary to allow the use of SIMD (Vc) types more generally, rather than in very localised places.

Second, I have added new development algorithms to start fully vectorising the RICH reconstruction. So far these are

A pixel algorithm that takes the scalar information and summarises it into SIMD TES data objects. Longer term this algorithm might well be retired itself, as once everything is based on SIMD types the first step making the scalar versions can be removed. For the moment though its useful to have both scalar and SIMD versions.
An (almost) fully SIMD vectorised (using Vc and GenVector types) quartic photon reconstruction algorithm is provided.
An (almost) fully SIMD vectorised version of the photon pixel probability (the next in line after the quartic algorithm).

I say almost in 2. and 3. as there are a few places where I have had to resort back to scalar loops over the SIMD types to perform some calculations, where I have yet to see any obvious way to do it fully vectorised. Generally this happens for instance when I have to follow a pointer to say a mirror segment, for each 'scalar' photon.

The results so far are looking quite good. For an SSE4.2 build (the default) I get

RichPhotonRecoLong              |    27.060 |    27.468 |    0.311     218.9    25.96 |    1000 |    27.469 |
RichPredPixelSignalLong         |     2.750 |     2.686 |    0.036      20.0     2.38 |    1000 |     2.687 |
RichSIMDPhotonRecoLong          |    14.220 |    14.168 |    0.194     110.1    12.99 |    1000 |    14.169 |
RichSIMDPredPixelSignalLong     |     1.690 |     1.621 |    0.027      11.9     1.41 |    1000 |     1.622 |

where the first two are the scalar versions and the last two the SIMD (SSE4.2).

If I instead build my stack allowing AVX2+FMA I get

RichPhotonRecoLong              |    23.100 |    22.873 |    0.263     184.1    21.60 |    1000 |    22.874 |
RichPredPixelSignalLong         |     2.310 |     2.301 |    0.032      15.6     2.03 |    1000 |     2.302 |
RichSIMDPhotonRecoLong          |     8.160 |     8.508 |    0.140      63.5     7.60 |    1000 |     8.508 |
RichSIMDPredPixelSignalLong     |     1.370 |     1.473 |    0.027      10.6     1.22 |    1000 |     1.474 |

So the factor 2 increase in the SIMD vector size (4 to 8 floats) is seen. Its not quite perfect, but I have ideas as to why this is... ( Note the scalar version for AVX2+FMA is already faster, as it is able to gain from the FMA part...).

Edited Nov 14, 2017 by Marco Cattaneo

Fully SIMD vectorised RICH photon reconstruction

Merge request reports