Templatise Stride and Vector width in vecIdxOfValue
The purpose of this MR is move towards multiversioning the min index finder functions used in the GSF refit procedure. By default sse4.2 is used which only supports 128bit vectors (4 x 32bit floats). Most machines support avx2 256 bit vectors (8 x 32bit floats) but this isn't required so can't be enforced.
For technical reasons simply increasing the vector width to 8 (x 32bits) degrades the performance when only sse4.2 is available so some fancy foot work is needed to support both.
Therefore promoting the vector width to a template parameter would allow for a multi-versioned wrapper function to call different vector width implementations of the min finder without relying on the compiler to figure out everything on it's own (this will come in a future MR).
Original:
6: ------------------------------------------------------------------------------------------------------------------
6: Benchmark Time CPU Iterations
6: ------------------------------------------------------------------------------------------------------------------
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::C> 6811 ns 6196 ns 112463
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::STL> 19547 ns 18958 ns 36810
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::VecAlwaysTrackIdx> 1794 ns 1745 ns 401326
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::VecUpdateIdxOnNewMin> 556 ns 498 ns 1500257
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::VecMinThenIdx> 505 ns 494 ns 1266073
Updated:
6: ------------------------------------------------------------------------------------------------------------------
6: Benchmark Time CPU Iterations
6: ------------------------------------------------------------------------------------------------------------------
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::C> 6600 ns 6325 ns 109736
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::STL> 20206 ns 19847 ns 36219
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::VecAlwaysTrackIdx> 1836 ns 1797 ns 386947
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::VecUpdateIdxOnNewMin> 495 ns 486 ns 1438054
6: benchmarkFindIdxOfMinimum<findIdxOfMinimum::Impl::VecMinThenIdx> 494 ns 482 ns 1362759 <<< The important one
I'm not totally happy with the hard coded GAUDI_LOOP_UNROLL(4)
but some sort of explicit unrolling seems to be needed to avoid slowing things down relative to the hard coded version this replaces.