Templatise Stride and Vector width in vecIdxOfValue (!72441) · Merge requests · atlas / athena

Lucy Lewitt requested to merge llewitt/athena:ALittleTemplatingInGsfFindIndexOfMin into main Jun 24, 2024

The purpose of this MR is move towards multiversioning the min index finder functions used in the GSF refit procedure. By default sse4.2 is used which only supports 128bit vectors (4 x 32bit floats). Most machines support avx2 256 bit vectors (8 x 32bit floats) but this isn't required so can't be enforced.

For technical reasons simply increasing the vector width to 8 (x 32bits) degrades the performance when only sse4.2 is available so some fancy foot work is needed to support both.

Therefore promoting the vector width to a template parameter would allow for a multi-versioned wrapper function to call different vector width implementations of the min finder without relying on the compiler to figure out everything on it's own (this will come in a future MR).

Original:

6: ------------------------------------------------------------------------------------------------------------------
6: Benchmark                                                                        Time             CPU   Iterations
6: ------------------------------------------------------------------------------------------------------------------
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::C>                          6811 ns         6196 ns       112463
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::STL>                       19547 ns        18958 ns        36810
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::VecAlwaysTrackIdx>          1794 ns         1745 ns       401326
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::VecUpdateIdxOnNewMin>        556 ns          498 ns      1500257
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::VecMinThenIdx>               505 ns          494 ns      1266073

Updated:

6: ------------------------------------------------------------------------------------------------------------------
6: Benchmark                                                                        Time             CPU   Iterations
6: ------------------------------------------------------------------------------------------------------------------
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::C>                          6600 ns         6325 ns       109736
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::STL>                       20206 ns        19847 ns        36219
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::VecAlwaysTrackIdx>          1836 ns         1797 ns       386947
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::VecUpdateIdxOnNewMin>        495 ns          486 ns      1438054
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::VecMinThenIdx>               494 ns          482 ns      1362759 <<< The important one

I'm not totally happy with the hard coded GAUDI_LOOP_UNROLL(4) but some sort of explicit unrolling seems to be needed to avoid slowing things down relative to the hard coded version this replaces.

Edited Jun 28, 2024 by Lucy Lewitt

Admin message

Templatise Stride and Vector width in vecIdxOfValue

Merge request reports