Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clang generates unoptimal code for code with lots of temporary variables #78506

Open
gbaraldi opened this issue Jan 17, 2024 · 4 comments
Open

Comments

@gbaraldi
Copy link
Contributor

I started a discussion in discourse but I think this might be worthy of an issue, I know the code looks strange but it's a C reproducer of what julia generates.

#include <stddef.h>

struct wtf {
    int a[30];
};

struct wtf __attribute__ ((noinline))  foo(struct wtf *b, int i){
    struct wtf new;


    int idx0 = (0 + i) % 29;
    int idx1 = (1 + i) % 29;
    int idx2 = (2 + i) % 29;
    int idx3 = (3 + i) % 29;
    int idx4 = (4 + i) % 29;
    int idx5 = (5 + i) % 29;
    int idx6 = (6 + i) % 29;
    int idx7 = (7 + i) % 29;
    int idx8 = (8 + i) % 29;
    int idx9 = (9 + i) % 29;
    int idx10 = (10 + i) % 29;
    int idx11 = (11 + i) % 29;
    int idx12 = (12 + i) % 29;
    int idx13 = (13 + i) % 29;
    int idx14 = (14 + i) % 29;
    int idx15 = (15 + i) % 29;
    int idx16 = (16 + i) % 29;
    int idx17 = (17 + i) % 29;
    int idx18 = (18 + i) % 29;
    int idx19 = (19 + i) % 29;
    int idx20 = (20 + i) % 29;
    int idx21 = (21 + i) % 29;
    int idx22 = (22 + i) % 29;
    int idx23 = (23 + i) % 29;
    int idx24 = (24 + i) % 29;
    int idx25 = (25 + i) % 29;
    int idx26 = (26 + i) % 29;
    int idx27 = (27 + i) % 29;
    int idx28 = (28 + i) % 29;
    int idx29 = (29 + i) % 29;

    int val0 = b->a[idx0];
    int val1 = b->a[idx1];
    int val2 = b->a[idx2];
    int val3 = b->a[idx3];
    int val4 = b->a[idx4];
    int val5 = b->a[idx5];
    int val6 = b->a[idx6];
    int val7 = b->a[idx7];
    int val8 = b->a[idx8];
    int val9 = b->a[idx9];
    int val10 = b->a[idx10];
    int val11 = b->a[idx11];
    int val12 = b->a[idx12];
    int val13 = b->a[idx13];
    int val14 = b->a[idx14];
    int val15 = b->a[idx15];
    int val16 = b->a[idx16];
    int val17 = b->a[idx17];
    int val18 = b->a[idx18];
    int val19 = b->a[idx19];
    int val20 = b->a[idx20];
    int val21 = b->a[idx21];
    int val22 = b->a[idx22];
    int val23 = b->a[idx23];
    int val24 = b->a[idx24];
    int val25 = b->a[idx25];
    int val26 = b->a[idx26];
    int val27 = b->a[idx27];
    int val28 = b->a[idx28];
    int val29 = b->a[idx29];
    new.a[0] = val0;
    new.a[1] = val1;
    new.a[2] = val2;
    new.a[3] = val3;
    new.a[4] = val4;
    new.a[5] = val5;
    new.a[6] = val6;
    new.a[7] = val7;
    new.a[8] = val8;
    new.a[9] = val9;
    new.a[10] = val10;
    new.a[11] = val11;
    new.a[12] = val12;
    new.a[13] = val13;
    new.a[14] = val14;
    new.a[15] = val15;
    new.a[16] = val16;
    new.a[17] = val17;
    new.a[18] = val18;
    new.a[19] = val19;
    new.a[20] = val20;
    new.a[21] = val21;
    new.a[22] = val22;
    new.a[23] = val23;
    new.a[24] = val24;
    new.a[25] = val25;
    new.a[26] = val26;
    new.a[27] = val27;
    new.a[28] = val28;
    new.a[29] = val29;
    return new;
}

#include <stddef.h>
#include <stdio.h>
#include <time.h>

volatile struct wtf result;

int main() {
    struct wtf b;
    for (int i = 0; i < 30; i++) {
        b.a[i] = i;
    }

    const int num_iterations = 100000000; // Number of times to call foo
    clock_t start, end;
    double cpu_time_used;

    start = clock();

    for (int i = 0; i < num_iterations; i++) {
        result = foo(&b, i % 30);
        // Optionally use result to prevent the compiler from optimizing out the function call
    }

    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

    printf("Time taken: %f seconds\n", cpu_time_used);

    return 0;
}

This code when compiled with gcc 12 is about 40% faster than what clang-17 generates, and by allowing it to inline it becomes around 60% faster, a quick peak at the assemblies shows a very different piece of code, with gcc using mostly xmm register and avx instructions while clang uses mostly normal registers.
https://godbolt.org/z/nWrKKe7s9

@github-actions github-actions bot added the clang Clang issues not falling into any other category label Jan 17, 2024
@EugeneZelenko EugeneZelenko added llvm:optimizations missed-optimization and removed clang Clang issues not falling into any other category labels Jan 17, 2024
@vient
Copy link
Member

vient commented Jan 17, 2024

You may help compiler by providing information about possible values of i. After adding i = (30 + i % 30) % 30; to beginning of foo code became quite smaller.

Also, is it supposed to be % 29 if the size of array is 30?

Here are some loop versions https://godbolt.org/z/571aPj9zh, when written as loop all spills are gone. Somehow both compilers fail to use the fact that i is in range [0, N) and increments by 1 so remainder operation can be replaced with a naive version. In bar both compilers managed to insert memcpy calls after all.

@vient
Copy link
Member

vient commented Jan 17, 2024

Another weird example, setting aligned value to 8 or higher brings a lot of spills in Clang https://godbolt.org/z/EKExxvxv3

@gbaraldi
Copy link
Contributor Author

Originally this was a circular shift, so I guess that is an optimization.
But yeah this is basically an unrolled loop but without interleaving the operations, and it simply surprised me that this didn't generate pretty good code, because it's not super complicated.

@gbaraldi
Copy link
Contributor Author

If anyone wants to play with the directly unrolled version (even though LLVM seems to like fully unrolling this loop anyway)
the following awk script generates the code for any size one wants, credits @aravindh-krishnamoorthy.

$0 ~ /^##.*/ {gsub("^## ",""); print}
$0 ~ /^#~.*/ {gsub("^#~ ",""); gsub("N",N); if($0 ~ /?/) {for (i=1; i<N; i++) {a = $0; gsub("?",i,a); print a}} else {print}}

## #include <stddef.h>
## #define assume(x) if (!(x)) { __builtin_unreachable(); }
## struct wtf {
#~     int a[N];
## };
## struct wtf __attribute__ ((noinline))  foo(struct wtf *b, int i)
## {
#~     assume(i >= 0 && i < N);
##     struct wtf new;
#~     int idx? = (? + i) % N ;
#~     int val? = b->a[idx?] ;
#~     new.a[?] = val? ;
##     return new;
## }
## 

usage with gawk -v N=50 -f rs.awk rs.awk > rs50.c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants