Skip to content

Master

Florian Lemaitre requested to merge flemaitr/kalfit-offloaded:master into master

Description

A version of the Kalman filter where the transposition (AoS -> SoA) is done in SIMD. It supports SSE, AVX and AVX512 ISAs and gracefully fallback on scalar code otherwise.

It chooses the best possible implementation for the target at compile time (with a bunch of #ifdefs)

Results

Other implementations are also shown in order to compare them, but in practice, only the most advanced version is used.

KNL

100 000 experiments, 64 cores, 256 threads

Standard deviation of the measurement is really high due to the high number of threads.

reference

Total statistics: 1104800000 fitted, 0 smoothed

(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)

Fit timers mean: 3.53128e-06 sum: 0.331941 min: 6.25467e-07 max: 1.42715e-05 stddev: 3.32958e-06
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0

tbb default_num_threads reports execution with 256 threads
Throughput per processor, estimated total throughput:
 Forward and Backward fit: 283183 / s, 7.24949e+07 / s 
 Smoother: inf / s, inf / s

unspecialized scalar

Total statistics: 1104800000 fitted, 0 smoothed

(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)

Fit timers mean: 4.48244e-06 sum: 0.421349 min: 5.66135e-07 max: 1.38183e-05 stddev: 3.90935e-06
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0

tbb default_num_threads reports execution with 256 threads
Throughput per processor, estimated total throughput:
 Forward and Backward fit: 223093 / s, 5.71118e+07 / s 
 Smoother: inf / s, inf / s

specialized scalar

Total statistics: 1104800000 fitted, 0 smoothed

(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)

Fit timers mean: 3.53128e-06 sum: 0.331941 min: 6.25467e-07 max: 1.42715e-05 stddev: 3.32958e-06
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0

tbb default_num_threads reports execution with 256 threads
Throughput per processor, estimated total throughput:
 Forward and Backward fit: 283183 / s, 7.24949e+07 / s 
 Smoother: inf / s, inf / s

sse

Total statistics: 1104800000 fitted, 0 smoothed

(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)

Fit timers mean: 2.5622e-06 sum: 0.240847 min: 4.24956e-07 max: 1.10585e-05 stddev: 2.44762e-06
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0

tbb default_num_threads reports execution with 256 threads
Throughput per processor, estimated total throughput:
 Forward and Backward fit: 390289 / s, 9.9914e+07 / s 
 Smoother: inf / s, inf / s

avx

Total statistics: 1104800000 fitted, 0 smoothed

(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)

Fit timers mean: 2.77701e-06 sum: 0.261039 min: 3.9401e-07 max: 1.17124e-05 stddev: 2.71119e-06
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0

tbb default_num_threads reports execution with 256 threads
Throughput per processor, estimated total throughput:
 Forward and Backward fit: 360100 / s, 9.21856e+07 / s 
 Smoother: inf / s, inf / s

avx512

Total statistics: 1104800000 fitted, 0 smoothed

(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)

Fit timers mean: 1.96184e-06 sum: 0.184413 min: 4.06507e-07 max: 8.63925e-06 stddev: 1.78076e-06
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0

tbb default_num_threads reports execution with 256 threads
Throughput per processor, estimated total throughput:
 Forward and Backward fit: 509726 / s, 1.3049e+08 / s 
 Smoother: inf / s, inf / s

Haswell Xeon

100 000 experiments, 14 cores, 28 threads

reference

Total statistics: 1104800000 fitted, 0 smoothed

(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)

Fit timers mean: 1.19894e-06 sum: 0.1127 min: 5.14212e-07 max: 1.66019e-06 stddev: 1.25836e-07
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0

tbb default_num_threads reports execution with 28 threads
Throughput per processor, estimated total throughput:
 Forward and Backward fit: 834073 / s, 2.3354e+07 / s 
 Smoother: inf / s, inf / s

unspecialized scalar

Total statistics: 1104800000 fitted, 0 smoothed

(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)

Fit timers mean: 1.17691e-06 sum: 0.110629 min: 2.0701e-07 max: 1.75273e-06 stddev: 2.09345e-07
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0

tbb default_num_threads reports execution with 28 threads
Throughput per processor, estimated total throughput:
 Forward and Backward fit: 849684 / s, 2.37911e+07 / s 
 Smoother: inf / s, inf / s

specialized scalar

Total statistics: 1104800000 fitted, 0 smoothed

(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)

Fit timers mean: 1.18789e-06 sum: 0.111661 min: 2.6805e-07 max: 1.81788e-06 stddev: 1.92684e-07
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0

tbb default_num_threads reports execution with 28 threads
Throughput per processor, estimated total throughput:
 Forward and Backward fit: 841830 / s, 2.35712e+07 / s 
 Smoother: inf / s, inf / s

sse

Total statistics: 1104800000 fitted, 0 smoothed

(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)

Fit timers mean: 1.121e-06 sum: 0.105374 min: 1.3497e-07 max: 1.5189e-06 stddev: 2.0252e-07
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0

tbb default_num_threads reports execution with 28 threads
Throughput per processor, estimated total throughput:
 Forward and Backward fit: 892063 / s, 2.49778e+07 / s 
 Smoother: inf / s, inf / s

avx

Total statistics: 1104800000 fitted, 0 smoothed

(Deprecated statistics: 1769600000 filtered, 1769600000 predicted, 1047300000 smoothed)

Fit timers mean: 1.08495e-06 sum: 0.101986 min: 1.01471e-07 max: 1.50432e-06 stddev: 1.88445e-07
Smoother timers mean: 0 sum: 0 min: 0 max: 0 stddev: 0

tbb default_num_threads reports execution with 28 threads
Throughput per processor, estimated total throughput:
 Forward and Backward fit: 921698 / s, 2.58076e+07 / s
 Smoother: inf / s, inf / s

Merge request reports