Skip to content

Vectorized prefix sum

Daniel Campora Perez requested to merge dcampora_vectorized_ps into master

This MR vectorizes the prefix sum (CPU), for any x86_64 target. The overall impact on the sequence is small, however it should allow for a more flexible -n setting, since the CPU becomes less of a bottleneck in scenarios with higher -n.

The following are performances of the prefix sum obtained for a run with -n 1000:

Prefix sum length: 164, speedup SSE: 1.964677x
Prefix sum length: 164, speedup SSE: 4.442857x
Prefix sum length: 4239, speedup SSE: 1.683824x
Prefix sum length: 164, speedup SSE: 2.571429x
Prefix sum length: 164, speedup SSE: 3.833333x
Prefix sum length: 29907, speedup SSE: 1.565231x
Prefix sum length: 13693, speedup SSE: 1.742566x
Prefix sum length: 490, speedup SSE: 3.772277x
Prefix sum length: 3095, speedup SSE: 2.340463x
Prefix sum length: 88673, speedup SSE: 2.564199x
Prefix sum length: 164, speedup SSE: 5.510000x
Prefix sum length: 1554, speedup SSE: 2.249110x
Prefix sum length: 1554, speedup SSE: 1.968966x
Prefix sum length: 20865, speedup SSE: 1.739200x
Prefix sum length: 653, speedup SSE: 5.246073x
Prefix sum length: 164, speedup SSE: 4.827273x
Prefix sum length: 1554, speedup SSE: 2.436137x
Prefix sum length: 845, speedup SSE: 1.882353x
Prefix sum length: 1554, speedup SSE: 2.743772x
Prefix sum length: 164, speedup SSE: 3.020000x
Prefix sum length: 164, speedup SSE: 4.025000x
Prefix sum length: 164, speedup SSE: 7.500000x
Prefix sum length: 5054, speedup SSE: 1.987385x
Prefix sum length: 164, speedup SSE: 2.250000x
Prefix sum length: 164, speedup SSE: 4.420000x
Prefix sum length: 338, speedup SSE: 3.484472x
Prefix sum length: 338, speedup SSE: 14.150000x
Prefix sum length: 8763, speedup SSE: 1.967170x
Prefix sum length: 338, speedup SSE: 3.380282x
Prefix sum length: 338, speedup SSE: 1.500000x
Prefix sum length: 63403, speedup SSE: 1.994542x
Prefix sum length: 28309, speedup SSE: 1.800994x
Prefix sum length: 1012, speedup SSE: 2.426316x
Prefix sum length: 6489, speedup SSE: 1.796820x
Prefix sum length: 183329, speedup SSE: 2.484096x
Prefix sum length: 338, speedup SSE: 2.322727x
Prefix sum length: 3413, speedup SSE: 2.382696x
Prefix sum length: 3413, speedup SSE: 2.289340x
Prefix sum length: 43137, speedup SSE: 1.749932x
Prefix sum length: 1349, speedup SSE: 1.950207x
Prefix sum length: 338, speedup SSE: 4.009091x
Prefix sum length: 3413, speedup SSE: 1.822866x
Prefix sum length: 1767, speedup SSE: 1.620588x
Prefix sum length: 3413, speedup SSE: 1.833611x
Prefix sum length: 338, speedup SSE: 1.637500x
Prefix sum length: 338, speedup SSE: 3.512500x
Prefix sum length: 338, speedup SSE: 1.585714x
Prefix sum length: 10448, speedup SSE: 1.638155x
Prefix sum length: 338, speedup SSE: 4.626374x
Prefix sum length: 338, speedup SSE: 4.728571x
Edited by Daniel Campora Perez

Merge request reports