Master
Description
A version of the Kalman filter where the transposition (AoS -> SoA) is done in SIMD. It supports SSE, AVX and AVX512 ISAs and gracefully fallback on scalar code otherwise.
It chooses the best possible implementation for the target at compile time (with a bunch of #ifdef
s)
Results
Other implementations are also shown in order to compare them, but in practice, only the most advanced version is used.
KNL
100 000 experiments, 64 cores, 256 threads
Standard deviation of the measurement is really high due to the high number of threads.
reference
Total statistics: 1104800000 fitted, 0 smoothed
(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)
Fit timers mean: 3.53128e-06 sum: 0.331941 min: 6.25467e-07 max: 1.42715e-05 stddev: 3.32958e-06
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0
tbb default_num_threads reports execution with 256 threads
Throughput per processor, estimated total throughput:
Forward and Backward fit: 283183 / s, 7.24949e+07 / s
Smoother: inf / s, inf / s
unspecialized scalar
Total statistics: 1104800000 fitted, 0 smoothed
(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)
Fit timers mean: 4.48244e-06 sum: 0.421349 min: 5.66135e-07 max: 1.38183e-05 stddev: 3.90935e-06
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0
tbb default_num_threads reports execution with 256 threads
Throughput per processor, estimated total throughput:
Forward and Backward fit: 223093 / s, 5.71118e+07 / s
Smoother: inf / s, inf / s
specialized scalar
Total statistics: 1104800000 fitted, 0 smoothed
(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)
Fit timers mean: 3.53128e-06 sum: 0.331941 min: 6.25467e-07 max: 1.42715e-05 stddev: 3.32958e-06
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0
tbb default_num_threads reports execution with 256 threads
Throughput per processor, estimated total throughput:
Forward and Backward fit: 283183 / s, 7.24949e+07 / s
Smoother: inf / s, inf / s
sse
Total statistics: 1104800000 fitted, 0 smoothed
(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)
Fit timers mean: 2.5622e-06 sum: 0.240847 min: 4.24956e-07 max: 1.10585e-05 stddev: 2.44762e-06
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0
tbb default_num_threads reports execution with 256 threads
Throughput per processor, estimated total throughput:
Forward and Backward fit: 390289 / s, 9.9914e+07 / s
Smoother: inf / s, inf / s
avx
Total statistics: 1104800000 fitted, 0 smoothed
(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)
Fit timers mean: 2.77701e-06 sum: 0.261039 min: 3.9401e-07 max: 1.17124e-05 stddev: 2.71119e-06
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0
tbb default_num_threads reports execution with 256 threads
Throughput per processor, estimated total throughput:
Forward and Backward fit: 360100 / s, 9.21856e+07 / s
Smoother: inf / s, inf / s
avx512
Total statistics: 1104800000 fitted, 0 smoothed
(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)
Fit timers mean: 1.96184e-06 sum: 0.184413 min: 4.06507e-07 max: 8.63925e-06 stddev: 1.78076e-06
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0
tbb default_num_threads reports execution with 256 threads
Throughput per processor, estimated total throughput:
Forward and Backward fit: 509726 / s, 1.3049e+08 / s
Smoother: inf / s, inf / s
Haswell Xeon
100 000 experiments, 14 cores, 28 threads
reference
Total statistics: 1104800000 fitted, 0 smoothed
(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)
Fit timers mean: 1.19894e-06 sum: 0.1127 min: 5.14212e-07 max: 1.66019e-06 stddev: 1.25836e-07
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0
tbb default_num_threads reports execution with 28 threads
Throughput per processor, estimated total throughput:
Forward and Backward fit: 834073 / s, 2.3354e+07 / s
Smoother: inf / s, inf / s
unspecialized scalar
Total statistics: 1104800000 fitted, 0 smoothed
(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)
Fit timers mean: 1.17691e-06 sum: 0.110629 min: 2.0701e-07 max: 1.75273e-06 stddev: 2.09345e-07
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0
tbb default_num_threads reports execution with 28 threads
Throughput per processor, estimated total throughput:
Forward and Backward fit: 849684 / s, 2.37911e+07 / s
Smoother: inf / s, inf / s
specialized scalar
Total statistics: 1104800000 fitted, 0 smoothed
(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)
Fit timers mean: 1.18789e-06 sum: 0.111661 min: 2.6805e-07 max: 1.81788e-06 stddev: 1.92684e-07
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0
tbb default_num_threads reports execution with 28 threads
Throughput per processor, estimated total throughput:
Forward and Backward fit: 841830 / s, 2.35712e+07 / s
Smoother: inf / s, inf / s
sse
Total statistics: 1104800000 fitted, 0 smoothed
(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)
Fit timers mean: 1.121e-06 sum: 0.105374 min: 1.3497e-07 max: 1.5189e-06 stddev: 2.0252e-07
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0
tbb default_num_threads reports execution with 28 threads
Throughput per processor, estimated total throughput:
Forward and Backward fit: 892063 / s, 2.49778e+07 / s
Smoother: inf / s, inf / s
avx
Total statistics: 1104800000 fitted, 0 smoothed
(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)
Fit timers mean: 1.08495e-06 sum: 0.101986 min: 1.01471e-07 max: 1.50432e-06 stddev: 1.88445e-07
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0
tbb default_num_threads reports execution with 28 threads
Throughput per processor, estimated total throughput:
Forward and Backward fit: 921698 / s, 2.58076e+07 / s
Smoother: inf / s, inf / s