Vectorized prefix sum
All threads resolved!
All threads resolved!
This MR vectorizes the prefix sum (CPU), for any x86_64
target. The overall impact on the sequence is small, however it should allow for a more flexible -n
setting, since the CPU becomes less of a bottleneck in scenarios with higher -n
.
The following are performances of the prefix sum obtained for a run with -n 1000
:
Prefix sum length: 164, speedup SSE: 1.964677x
Prefix sum length: 164, speedup SSE: 4.442857x
Prefix sum length: 4239, speedup SSE: 1.683824x
Prefix sum length: 164, speedup SSE: 2.571429x
Prefix sum length: 164, speedup SSE: 3.833333x
Prefix sum length: 29907, speedup SSE: 1.565231x
Prefix sum length: 13693, speedup SSE: 1.742566x
Prefix sum length: 490, speedup SSE: 3.772277x
Prefix sum length: 3095, speedup SSE: 2.340463x
Prefix sum length: 88673, speedup SSE: 2.564199x
Prefix sum length: 164, speedup SSE: 5.510000x
Prefix sum length: 1554, speedup SSE: 2.249110x
Prefix sum length: 1554, speedup SSE: 1.968966x
Prefix sum length: 20865, speedup SSE: 1.739200x
Prefix sum length: 653, speedup SSE: 5.246073x
Prefix sum length: 164, speedup SSE: 4.827273x
Prefix sum length: 1554, speedup SSE: 2.436137x
Prefix sum length: 845, speedup SSE: 1.882353x
Prefix sum length: 1554, speedup SSE: 2.743772x
Prefix sum length: 164, speedup SSE: 3.020000x
Prefix sum length: 164, speedup SSE: 4.025000x
Prefix sum length: 164, speedup SSE: 7.500000x
Prefix sum length: 5054, speedup SSE: 1.987385x
Prefix sum length: 164, speedup SSE: 2.250000x
Prefix sum length: 164, speedup SSE: 4.420000x
Prefix sum length: 338, speedup SSE: 3.484472x
Prefix sum length: 338, speedup SSE: 14.150000x
Prefix sum length: 8763, speedup SSE: 1.967170x
Prefix sum length: 338, speedup SSE: 3.380282x
Prefix sum length: 338, speedup SSE: 1.500000x
Prefix sum length: 63403, speedup SSE: 1.994542x
Prefix sum length: 28309, speedup SSE: 1.800994x
Prefix sum length: 1012, speedup SSE: 2.426316x
Prefix sum length: 6489, speedup SSE: 1.796820x
Prefix sum length: 183329, speedup SSE: 2.484096x
Prefix sum length: 338, speedup SSE: 2.322727x
Prefix sum length: 3413, speedup SSE: 2.382696x
Prefix sum length: 3413, speedup SSE: 2.289340x
Prefix sum length: 43137, speedup SSE: 1.749932x
Prefix sum length: 1349, speedup SSE: 1.950207x
Prefix sum length: 338, speedup SSE: 4.009091x
Prefix sum length: 3413, speedup SSE: 1.822866x
Prefix sum length: 1767, speedup SSE: 1.620588x
Prefix sum length: 3413, speedup SSE: 1.833611x
Prefix sum length: 338, speedup SSE: 1.637500x
Prefix sum length: 338, speedup SSE: 3.512500x
Prefix sum length: 338, speedup SSE: 1.585714x
Prefix sum length: 10448, speedup SSE: 1.638155x
Prefix sum length: 338, speedup SSE: 4.626374x
Prefix sum length: 338, speedup SSE: 4.728571x
Edited by Daniel Hugo Campora Perez
Merge request reports
Activity
added RTA label
removed RTA label
- A deleted user
added hlt1-throughput-decreased label
added RTA label
- Resolved by Rosen Matev
Hi @dcampora, shall we test this?
added 31 commits
-
5f3c6ff1...b25f1d5f - 29 commits from branch
master
- c01168c3 - Implemented hand-vectorized prefix sum.
- 50ee7793 - Cleanup.
-
5f3c6ff1...b25f1d5f - 29 commits from branch
removed hlt1-throughput-decreased label
added enhancement label
mentioned in issue Moore#382 (closed)
assigned to @thboettc
added ci-test-triggered label
- [2022-02-07 20:59] Validation started with lhcb-master-mr#3652
assigned to @rmatev
unassigned @thboettc
mentioned in commit 6bf2aa95
mentioned in issue Moore#390 (closed)
mentioned in issue MooreAnalysis#30 (closed)
Please register or sign in to reply