RungeKuttaUtils attempts to vectorize
Let me mention @ssnyder , @goblirsc , @amorley with whom I have discussed this.
The issue is that gcc (even with -ftree-vectorize
) will not perform so called SLP
vectorization while clang does.
Note that although they fit in one line that later block is 30 multiplication,30 subtractions
P[ 7]-=(s0*P[ 3]); P[ 8]-=(s0*P[ 4]); P[ 9]-=(s0*P[ 5]);
P[10]-=(s0*P[42]); P[11]-=(s0*P[43]); P[12]-=(s0*P[44]);
P[14]-=(s1*P[ 3]); P[15]-=(s1*P[ 4]); P[16]-=(s1*P[ 5]);
P[17]-=(s1*P[42]); P[18]-=(s1*P[43]); P[19]-=(s1*P[44]);
P[21]-=(s2*P[ 3]); P[22]-=(s2*P[ 4]); P[23]-=(s2*P[ 5]);
P[24]-=(s2*P[42]); P[25]-=(s2*P[43]); P[26]-=(s2*P[44]);
P[28]-=(s3*P[ 3]); P[29]-=(s3*P[ 4]); P[30]-=(s3*P[ 5]);
P[31]-=(s3*P[42]); P[32]-=(s3*P[43]); P[33]-=(s3*P[44]);
P[35]-=(s4*P[ 3]); P[36]-=(s4*P[ 4]); P[37]-=(s4*P[ 5]);
P[38]-=(s4*P[42]); P[39]-=(s4*P[43]); P[40]-=(s4*P[44]);
There were a few solution tried.
-
The most readable one prb would be (Vec6) to change the representation to something like this (one or @ssnyder suggestion) https://github.com/AnChristos/RkTransforms/blob/718782818a6e00288824fe8c063b8b232a173a23/transforms.h#L15, https://github.com/AnChristos/RkTransforms/blob/718782818a6e00288824fe8c063b8b232a173a23/transforms.cxx#L333. which seems to be only a bit slower with the version here. But to work we need to avoid too many conversion from the current P[45] to that Pstruct. Probably this will need to follow up a bit the call chain/change callers though as the P is passed from outside.
-
The other way (still I tried to make the names a bit more readable wrt to
P[...]
) is the one here (Vec2). -
To give an idea these are the timing from @ssnyder (~30% for these operations).
Compiled with -O3, gcc 10.2.1)
transform_bench/8192 122138 ns 121867 ns 5125
transformVec_bench/8192 93239 ns 93050 ns 7568
transformVec2_bench/8192 79113 ns 78945 ns 8892
transformVec6_bench/8192 84106 ns 83939 ns 8313
The RunTier0 tests show no diff in outputRunTier0Tests.log
Let me mention @amete on if he has a good idea how to profile this in production. These are called quite a bit as part of the propagation if I recall correctly.
Merge request reports
Activity
added 1 commit
- dc74f91b - Also use globalToLocalVecHelper in transformGlobalToCurvilinear
added review-pending-level-1 label
CI Result SUCCESS (hash 0e05a616)Athena AthSimulation AthGeneration AnalysisBase externals cmake make required tests optional tests Full details available on this CI monitor view
Athena: number of compilation errors 0, warnings 0
AthSimulation: number of compilation errors 0, warnings 0
AthGeneration: number of compilation errors 0, warnings 0
AnalysisBase: number of compilation errors 0, warnings 0
For experts only: Jenkins output [CI-MERGE-REQUEST-CC7 21016] CI Result SUCCESS (hash dc74f91b)Athena AthSimulation AthGeneration AnalysisBase externals cmake make required tests optional tests Full details available on this CI monitor view
Athena: number of compilation errors 0, warnings 0
AthSimulation: number of compilation errors 0, warnings 0
AthGeneration: number of compilation errors 0, warnings 0
AnalysisBase: number of compilation errors 0, warnings 0
For experts only: Jenkins output [CI-MERGE-REQUEST-CC7 21021]added 1 commit
- 98fd0db1 - Vectorize part of the globalToLocal tranforms, try to use more meaning full names
CI Result SUCCESS (hash 98fd0db1)Athena AthSimulation AthGeneration AnalysisBase externals cmake make required tests optional tests Full details available on this CI monitor view
Athena: number of compilation errors 0, warnings 0
AthSimulation: number of compilation errors 0, warnings 0
AthGeneration: number of compilation errors 0, warnings 0
AnalysisBase: number of compilation errors 0, warnings 0
For experts only: Jenkins output [CI-MERGE-REQUEST-CC7 21040]added 1 commit
- b5e2eb0f - RungeKuttaUtils vectorize parts , simplify implementation a bit more