RungeKuttaUtils attempts to vectorize
Let me mention @ssnyder , @goblirsc , @amorley with whom I have discussed this.
The issue is that gcc (even with -ftree-vectorize
) will not perform so called SLP
vectorization while clang does.
Note that although they fit in one line that later block is 30 multiplication,30 subtractions
P[ 7]-=(s0*P[ 3]); P[ 8]-=(s0*P[ 4]); P[ 9]-=(s0*P[ 5]);
P[10]-=(s0*P[42]); P[11]-=(s0*P[43]); P[12]-=(s0*P[44]);
P[14]-=(s1*P[ 3]); P[15]-=(s1*P[ 4]); P[16]-=(s1*P[ 5]);
P[17]-=(s1*P[42]); P[18]-=(s1*P[43]); P[19]-=(s1*P[44]);
P[21]-=(s2*P[ 3]); P[22]-=(s2*P[ 4]); P[23]-=(s2*P[ 5]);
P[24]-=(s2*P[42]); P[25]-=(s2*P[43]); P[26]-=(s2*P[44]);
P[28]-=(s3*P[ 3]); P[29]-=(s3*P[ 4]); P[30]-=(s3*P[ 5]);
P[31]-=(s3*P[42]); P[32]-=(s3*P[43]); P[33]-=(s3*P[44]);
P[35]-=(s4*P[ 3]); P[36]-=(s4*P[ 4]); P[37]-=(s4*P[ 5]);
P[38]-=(s4*P[42]); P[39]-=(s4*P[43]); P[40]-=(s4*P[44]);
There were a few solution tried.
-
The most readable one prb would be (Vec6) to change the representation to something like this (one or @ssnyder suggestion) https://github.com/AnChristos/RkTransforms/blob/718782818a6e00288824fe8c063b8b232a173a23/transforms.h#L15, https://github.com/AnChristos/RkTransforms/blob/718782818a6e00288824fe8c063b8b232a173a23/transforms.cxx#L333. which seems to be only a bit slower with the version here. But to work we need to avoid too many conversion from the current P[45] to that Pstruct. Probably this will need to follow up a bit the call chain/change callers though as the P is passed from outside.
-
The other way (still I tried to make the names a bit more readable wrt to
P[...]
) is the one here (Vec2). -
To give an idea these are the timing from @ssnyder (~30% for these operations).
Compiled with -O3, gcc 10.2.1)
transform_bench/8192 122138 ns 121867 ns 5125
transformVec_bench/8192 93239 ns 93050 ns 7568
transformVec2_bench/8192 79113 ns 78945 ns 8892
transformVec6_bench/8192 84106 ns 83939 ns 8313
The RunTier0 tests show no diff in outputRunTier0Tests.log
Let me mention @amete on if he has a good idea how to profile this in production. These are called quite a bit as part of the propagation if I recall correctly.