Skip to content

RungeKuttaUtils attempts to vectorize

Let me mention @ssnyder , @goblirsc , @amorley with whom I have discussed this.

The issue is that gcc (even with -ftree-vectorize) will not perform so called SLP vectorization while clang does.

Note that although they fit in one line that later block is 30 multiplication,30 subtractions

    P[ 7]-=(s0*P[ 3]); P[ 8]-=(s0*P[ 4]); P[ 9]-=(s0*P[ 5]);
    P[10]-=(s0*P[42]); P[11]-=(s0*P[43]); P[12]-=(s0*P[44]);
    P[14]-=(s1*P[ 3]); P[15]-=(s1*P[ 4]); P[16]-=(s1*P[ 5]);
    P[17]-=(s1*P[42]); P[18]-=(s1*P[43]); P[19]-=(s1*P[44]);
    P[21]-=(s2*P[ 3]); P[22]-=(s2*P[ 4]); P[23]-=(s2*P[ 5]);
    P[24]-=(s2*P[42]); P[25]-=(s2*P[43]); P[26]-=(s2*P[44]);
    P[28]-=(s3*P[ 3]); P[29]-=(s3*P[ 4]); P[30]-=(s3*P[ 5]);
    P[31]-=(s3*P[42]); P[32]-=(s3*P[43]); P[33]-=(s3*P[44]);
    P[35]-=(s4*P[ 3]); P[36]-=(s4*P[ 4]); P[37]-=(s4*P[ 5]);
    P[38]-=(s4*P[42]); P[39]-=(s4*P[43]); P[40]-=(s4*P[44]);

There were a few solution tried.

Compiled with -O3, gcc 10.2.1)

transform_bench/8192         122138 ns       121867 ns     5125
transformVec_bench/8192       93239 ns        93050 ns    7568
transformVec2_bench/8192      79113 ns        78945 ns   8892
transformVec6_bench/8192      84106 ns        83939 ns   8313

The RunTier0 tests show no diff in outputRunTier0Tests.log

Let me mention @amete on if he has a good idea how to profile this in production. These are called quite a bit as part of the propagation if I recall correctly.

Edited by Christos Anastopoulos

Merge request reports