Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oscillations in performance after different precompilation runs #51988

Open
Tortar opened this issue Nov 2, 2023 · 23 comments
Open

Oscillations in performance after different precompilation runs #51988

Tortar opened this issue Nov 2, 2023 · 23 comments
Labels
performance Must go faster

Comments

@Tortar
Copy link
Contributor

Tortar commented Nov 2, 2023

When trying to optimize the Julia code for the benchmark at https://github.com/jinyus/related_post_gen we realized that results for the Julia code varied much more that for other languages, for more on this see https://discourse.julialang.org/t/funny-benchmark-with-julia-no-longer-at-the-bottom/104611/141. When trying on my machine I found the same. I retried on another one and I reproduced it, so I came to the conclusion that something is strange.

The procedure to reproduce this behaviour is:

  1. Run git clone https://github.com/Tortar/related_post_gen.git
  2. Run julia -e 'using Pkg; Pkg.develop(path="./related_post_gen/julia/Related")'
  3. Run julia -e "using Related; for _ in 1:10 main() end"
  4. Mimic a change in the source code so that to trigger again precompilation on the next run (e.g. I added a space somewhere)
  5. Run julia -e "using Related; for _ in 1:10 main() end"
  6. If necessary, repeat 4. and 5. a few times to be able to reproduce

This is what I see on my computer:

bob@bob-Victus-by-HP-Gaming-Laptop-15-fb0xxx:~/related_post_gen/julia$ julia -e "using Related; for _ in 1:10 main() end"
Processing time (w/o IO): 305 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 26 milliseconds
Processing time (w/o IO): 26 milliseconds
Processing time (w/o IO): 26 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 26 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 25 milliseconds

# this will retrigger precompilation as specified above
bob@bob-Victus-by-HP-Gaming-Laptop-15-fb0xxx:~/related_post_gen/julia$ julia -e "using Related; for _ in 1:10 main() end"
Processing time (w/o IO): 294 milliseconds
Processing time (w/o IO): 18 milliseconds
Processing time (w/o IO): 18 milliseconds
Processing time (w/o IO): 19 milliseconds
Processing time (w/o IO): 18 milliseconds
Processing time (w/o IO): 18 milliseconds
Processing time (w/o IO): 18 milliseconds
Processing time (w/o IO): 18 milliseconds
Processing time (w/o IO): 18 milliseconds
Processing time (w/o IO): 18 milliseconds
Processing time (w/o IO): 18 milliseconds

As you can see there is a variation of nearly 40% in performance. Notice that the algorithm of the Related source file is deterministic.

Version infos for the two computers where I reproduced this behaviour:

Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × AMD Ryzen 5 5600H with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 12 virtual cores
-----------------------------------------------------------
Julia Version 1.9.3
Commit bed2cd540a (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 6 × Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 6 virtual cores

Apart from the (maybe) recoverable performance that solving this issue can get, I want to stress that these oscillations create difficulties in optimizing the code in a time efficient way since one can be misleaded by these oscillations.

@Tortar
Copy link
Contributor Author

Tortar commented Nov 2, 2023

I tried to substitute MVector and SVector from StaticArrays with just Vector to see if this issue only happens with the types of that package...and no, it happens also with just vectors, but incredibly it happens less often, you need to try quite a few times

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 2, 2023

That result sounds fairly typical. There are some tools you can do to reduce noise on your system, but these will likely only help you a little https://juliaci.github.io/BenchmarkTools.jl/dev/linuxtips/#Additional-resources

Even with those, we usually see +-20% performance running exactly the same scalar code on exactly the same CPU frequency settings every day in our testing https://github.com/JuliaCI/NanosoldierReports/blob/master/benchmark/by_date/2023-08/02/report.md

I cannot find the link now, but prior researchers have shown that you can have even larger--reproducible--differences in performance due to minute changes such as the length of the environment variables block or the current working directory used to run the program from.

@Tortar
Copy link
Contributor Author

Tortar commented Nov 2, 2023

To me this sounds a bit different because I can wait an arbitrary amount of time after precompilation, I can even restart the OS and I will get those same amount of time e.g.

bob@bob-Victus-by-HP-Gaming-Laptop-15-fb0xxx:~$ julia -e "using Related; for _ in 1:10 main() end"
Processing time (w/o IO): 297 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 23 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 24 milliseconds
# I restart, wait ten minutes, or anything, and still same times
bob@bob-Victus-by-HP-Gaming-Laptop-15-fb0xxx:~$ julia -e "using Related; for _ in 1:10 main() end"
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 23 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 24 milliseconds

If I retrigger precompilation then I can consistently get 18/19 ms after that if I'm lucky.

Is this also something expected? Maybe more specifically, should the precompilation point in time consistently affect results after that? Can we be "lucky" all the time?

@vchuravy
Copy link
Member

vchuravy commented Nov 2, 2023

You could use perf or vtunes to record two profiles and try to see the difference.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 2, 2023

Yes, that is expected. You can't tell from the benchmark report directly, but I also know that measurement is likely highly consistent for that particular build arrangement, but likely will change from build-to-build as various unrelated other changes occur to other unrelated code. It is unlikely you can get lucky all the time: Intel hasn't documented how, and there are too many different things that all need to be in one of many possible fast arrangements.

@Tortar
Copy link
Contributor Author

Tortar commented Nov 2, 2023

Oh, this is very interesting, I can just provide "empirical evidence" since I'm surely not an expert on this: I tried with the code in Rust in the fork I provided like 20 times, all the builds resulted in nearly equal times. I'm almost sure from what I saw on the VM on that repository that this will translate to (maybe almost) all other programming languages in that comparison apart from Julia. So this begs the question: what is special about Julia? Will run the perf profiler, maybe there could still be seen something interesting

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 2, 2023

The rust compiler is likely fairly stable at compiling identical code from identical input, but would also likely run into similar issues if you introduced unrelated changes into the code

@gbaraldi
Copy link
Member

gbaraldi commented Nov 2, 2023

Like vchuravy said, you can use perf together with export ENABLE_JITPROFILING=1 and inspect the generated assembly. While some variation is expected, we've also seen a bug with this before.

@Tortar
Copy link
Contributor Author

Tortar commented Nov 2, 2023

I added the results of

$ ENABLE_JITPROFILING=1 perf record -o /tmp/perf.data --call-graph dwarf -k 1 julia -e "using Related; for _ in 1:10 main() end"
$ perf inject --jit --input /tmp/perf.data --output /tmp/perf-jit.data # with slowv and fastv

for both version inside the fork https://github.com/Tortar/related_post_gen/blob/main/perf-jit-results.tar.xz. There are some differences, but I'm not able to say if they are significant or not. It is needed someone more capable of me on this front.

Maybe it is not very informative but I ran also this:

Fast version:

> root@bob-Victus-by-HP-Gaming-Laptop-15-fb0xxx:~/related_post_gen/julia/Related/src# perf stat julia -e "using Related; for _ in 1:10 main() end"

Processing time (w/o IO): 19 milliseconds
Processing time (w/o IO): 18 milliseconds
Processing time (w/o IO): 18 milliseconds
Processing time (w/o IO): 18 milliseconds
Processing time (w/o IO): 18 milliseconds
Processing time (w/o IO): 19 milliseconds
Processing time (w/o IO): 19 milliseconds
Processing time (w/o IO): 19 milliseconds
Processing time (w/o IO): 18 milliseconds
Processing time (w/o IO): 18 milliseconds

 Performance counter stats for '/root/julia-1.9.3/bin/julia -e using Related; for _ in 1:10 main() end':

          1.427,72 msec task-clock                       #    1,398 CPUs utilized             
                28      context-switches                 #   19,612 /sec                      
                 2      cpu-migrations                   #    1,401 /sec                      
            65.678      page-faults                      #   46,002 K/sec                     
     5.836.093.356      cycles                           #    4,088 GHz                         (82,62%)
        10.268.143      stalled-cycles-frontend          #    0,18% frontend cycles idle        (82,90%)
       364.253.214      stalled-cycles-backend           #    6,24% backend cycles idle         (83,76%)
    11.044.985.927      instructions                     #    1,89  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (84,02%)
     2.530.619.291      branches                         #    1,772 G/sec                       (84,03%)
        75.221.276      branch-misses                    #    2,97% of all branches             (82,84%)

       1,021072282 seconds time elapsed

       0,926903000 seconds user
       0,503069000 seconds sys

Slow version:

> root@bob-Victus-by-HP-Gaming-Laptop-15-fb0xxx:~/related_post_gen/julia/Related/src# perf stat julia -e "using Related; for _ in 1:10 main() end"
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 25 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 23 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 24 milliseconds
Processing time (w/o IO): 24 milliseconds

 Performance counter stats for '/root/julia-1.9.3/bin/julia -e using Related; for _ in 1:10 main() end':

          1.494,16 msec task-clock                       #    1,375 CPUs utilized             
                23      context-switches                 #   15,393 /sec                      
                 0      cpu-migrations                   #    0,000 /sec                      
            65.434      page-faults                      #   43,793 K/sec                     
     6.134.480.189      cycles                           #    4,106 GHz                         (82,80%)
         9.888.406      stalled-cycles-frontend          #    0,16% frontend cycles idle        (83,94%)
       343.693.440      stalled-cycles-backend           #    5,60% backend cycles idle         (83,94%)
    11.211.154.093      instructions                     #    1,83  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (83,94%)
     2.573.868.880      branches                         #    1,723 G/sec                       (83,21%)
        75.194.423      branch-misses                    #    2,92% of all branches             (82,19%)

       1,086947161 seconds time elapsed

       0,979065000 seconds user
       0,516910000 seconds sys

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 2, 2023

1.021 sec vs. 1.086 sec seems fine (measured at 7% perf difference vs 33% slower)

@Tortar
Copy link
Contributor Author

Tortar commented Nov 2, 2023

yes but there are a lot of other things inside the function (I/O), probably I should actually change the code so that it doesn't do these extra unrelated computations. I will update the code in the fork.

edit: actually I think this is really necessary to be able to analyze the perf results, so I will update the code, rerun everything and let you know

@gbaraldi
Copy link
Member

gbaraldi commented Nov 2, 2023

Do something like

export ENABLE_JITPROFILING=1
perf record -e cycles,cycle_activity.cycles_mem_any,instructions,cache-referen
ces,cache-misses --call-graph fp -k 1 ./julia file_to_run.jl
perf inject --jit --input perf.data --output perf-jit.data
perf report -n --no-children

So that you can analyze the assembly with callgraphs and JIT code

@Tortar
Copy link
Contributor Author

Tortar commented Nov 3, 2023

ok new results where I minimized I/O but didn't remove all of it because otherwise I wasn't able to reproduce (since there is still some of it the difference is a bit less than the real one):

 Performance counter stats for '/root/julia-1.9.3/bin/julia -e using Related; for i in 1:100 main() end':

          3.803,99 msec task-clock                       #    1,119 CPUs utilized             
               228      context-switches                 #   59,937 /sec                      
                 4      cpu-migrations                   #    1,052 /sec                      
            74.152      page-faults                      #   19,493 K/sec                     
    15.598.227.230      cycles                           #    4,100 GHz                         (83,06%)
        12.291.529      stalled-cycles-frontend          #    0,08% frontend cycles idle        (83,31%)
       372.336.701      stalled-cycles-backend           #    2,39% backend cycles idle         (83,50%)
    41.575.077.159      instructions                     #    2,67  insn per cycle            
                                                  #    0,01  stalled cycles per insn     (83,54%)
    11.152.200.744      branches                         #    2,932 G/sec                       (83,60%)
        91.040.325      branch-misses                    #    0,82% of all branches             (83,08%)

       3,399003245 seconds time elapsed

       3,251535000 seconds user
       0,553068000 seconds sys


 Performance counter stats for '/root/julia-1.9.3/bin/julia -e using Related; for i in 1:100 main() end':

          3.212,95 msec task-clock                       #    1,130 CPUs utilized             
                39      context-switches                 #   12,138 /sec                      
                 5      cpu-migrations                   #    1,556 /sec                      
            73.087      page-faults                      #   22,748 K/sec                     
    13.072.904.174      cycles                           #    4,069 GHz                         (83,05%)
        10.544.766      stalled-cycles-frontend          #    0,08% frontend cycles idle        (83,40%)
       222.939.657      stalled-cycles-backend           #    1,71% backend cycles idle         (83,57%)
    41.415.800.979      instructions                     #    3,17  insn per cycle            
                                                  #    0,01  stalled cycles per insn     (83,57%)
    11.157.758.283      branches                         #    3,473 G/sec                       (83,51%)
        90.838.049      branch-misses                    #    0,81% of all branches             (83,02%)

       2,842628434 seconds time elapsed

       2,639978000 seconds user
       0,574853000 seconds sys

still didn't have the time to try to analyze this a bit better

@Tortar
Copy link
Contributor Author

Tortar commented Nov 3, 2023

Used

export ENABLE_JITPROFILING=1
perf record -e "cycles,instructions,cache-references,cache-misses" --call-graph fp -k 1 /root/julia-1.9.3/bin/julia -e "using Related; for i in 1:100 main() end"
perf inject --jit --input perf.data --output perf-jit.data

I'm not sure if I'm identifying the right thing but these are quite different:

Screenshot from 2023-11-03 18-16-28

I used perf report /tmp/perf-jit.data and went to the instruction section

@vchuravy
Copy link
Member

vchuravy commented Nov 3, 2023

Could you upload the two files?

@gbaraldi
Copy link
Member

gbaraldi commented Nov 3, 2023

You can also inspect the assembly by pressing a IIRC

@Tortar
Copy link
Contributor Author

Tortar commented Nov 3, 2023

Here are two new fresh runs (one slow one fast): https://github.com/Tortar/related_post_gen/blob/main/perf-jit-results.tar.xz.

I actually see a lot of variations in the collected instructions between runs. Will try to inspect the assembly

@Tortar
Copy link
Contributor Author

Tortar commented Nov 4, 2023

Inspected the assembly, even if the profiler doesn't sample the same instruction for both runs, it seems equivalent, the only thing I notice is that some instructions are hit more in one case than in the other e.g. these instructions are sampled a lot only in the slow version

 45,30 │1f5:   mov        jl_undefref_exception,%rax
  0,26 │       mov        (%rax),%rdi 
 44,17 │     → call       ijl_throw@plt

but I have no idea what does this mean/imply, maybe if someone more expert look into those profiler data something more insightful can be understood

@Tortar
Copy link
Contributor Author

Tortar commented Nov 4, 2023

I also saw that in the @code_llvm of main() there is non determinism in some pointers, they change after each restart e.g. inttoptr (i64 140059622168704 to {}*), inttoptr (i64 139882960320640 to {}*), ...

This happens also with much simpler code e.g. @code_llvm [1,2] so I'm not sure if this is something related.

Also, I found this very interesting issue in Rust rust-lang/rust#69060 where they discuss non-deterministic performance of LLVM where they manage to reduce the variance on their benchmarks, I'm not sure if this is something also affecting Julia in the same way

Citing @eddyb:

LLVM is known to have trouble maintaining deterministic performance, in a similar vein to rustc hashing pointers, but potentially even affecting output for LLVM optimziations.

I highly advise disabling at least userspace ASLR, and manually check (by dumping /proc/self/maps from rustc? or even interned e.g. Ty pointers) that the heap addresses match between two runs, even with a slightly different rustc.
There's also the question of randomness sensitivity, but I suspect we don't want that even in real builds.

IIRC emberian managed to get the nondeterminism down to a bimodal (i.e. two possible profiles) situation for "irsy" (Is Rust Slim Yet), back in the day.
It's a bit of a shame that the "irsy" lessons were lost in time.

(yet the deterministic performance troubles of LLVM in tha benchmark appears to be much less large than the one here)

@Tortar
Copy link
Contributor Author

Tortar commented Jan 10, 2024

just to let you know, I tried again exploring this problem (nothing new to report about the cause though) but I see much worse performance oscillations in 1.10.0 using the procedure described in the first comment:

It is trimodal now :D (I wonder if this was even true in 1.9)

bob@bob-Victus-by-HP-Gaming-Laptop-15-fb0xxx:~/Desktop$ julia -e  "using Related;  for i in 1:10 main() end"
Processing time (w/o IO): 30.253747ms
Processing time (w/o IO): 30.309371000000002ms
Processing time (w/o IO): 31.134902999999998ms
Processing time (w/o IO): 30.449536ms
Processing time (w/o IO): 30.864696000000002ms
Processing time (w/o IO): 30.317126ms
Processing time (w/o IO): 30.905683ms
Processing time (w/o IO): 30.394802ms
Processing time (w/o IO): 30.972208000000002ms
Processing time (w/o IO): 30.119494ms
bob@bob-Victus-by-HP-Gaming-Laptop-15-fb0xxx:~/Desktop$ julia -e  "using Related;  for i in 1:10 main() end"
Processing time (w/o IO): 24.008126ms
Processing time (w/o IO): 24.196011ms
Processing time (w/o IO): 25.057724999999998ms
Processing time (w/o IO): 24.419883ms
Processing time (w/o IO): 24.002025ms
Processing time (w/o IO): 24.793638ms
Processing time (w/o IO): 24.269709ms
Processing time (w/o IO): 24.784820999999997ms
Processing time (w/o IO): 24.010531ms
Processing time (w/o IO): 24.051418ms
bob@bob-Victus-by-HP-Gaming-Laptop-15-fb0xxx:~/Desktop$ julia -e  "using Related;  for i in 1:10 main() end"
Processing time (w/o IO): 17.505882ms
Processing time (w/o IO): 17.751333000000002ms
Processing time (w/o IO): 17.738951ms
Processing time (w/o IO): 17.976658ms
Processing time (w/o IO): 18.051338ms
Processing time (w/o IO): 18.367432ms
Processing time (w/o IO): 17.441492ms
Processing time (w/o IO): 18.033594ms
Processing time (w/o IO): 18.176954ms
Processing time (w/o IO): 18.074452ms

Notice the 70% difference between worst and best "build"

@Tortar
Copy link
Contributor Author

Tortar commented Jan 10, 2024

I checked the cached binaries for the Related package and they change after each precompilation, specifically 3 lines out of 328 (6th, 9th and last), is this expected?

@Tortar
Copy link
Contributor Author

Tortar commented Jan 10, 2024

Tried with two different packages in the general registry, in both again only 6th, 9th, and last are the only different lines between precompilation runs, I have no idea if this can have the effect on changing the performance at the moment (I naively tried to patch the .ji with the one from the most performant build but it retriggers precompilation). It seems strange that these three lines are different in all tested packages, maybe this could be an issue in itself?

@IanButterworth
Copy link
Sponsor Member

I recommend not modifying the source between runs but renaming the cache files. That way you know the source hasn't changed at all (including its mtime).

@nsajko nsajko added the performance Must go faster label Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

6 participants