-
Notifications
You must be signed in to change notification settings - Fork 11.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WebAssembly] Switch lowering #63909
Comments
@llvm/issue-subscribers-backend-webassembly |
The reason the first function is lowered to llvm-project/llvm/lib/CodeGen/SwitchLoweringUtils.cpp Lines 46 to 188 in 2306f89
llvm-project/llvm/lib/CodeGen/TargetLoweringBase.cpp Lines 1674 to 1694 in 2306f89
But yeah, as you suggested, even though the first one is 'dense' from LLVM's perspective, it may be still too large for ~/llvm-git/install.debug/bin/llc test.ll -max-jump-table-size=19 while this is lowered down to ~/llvm-git/install.debug/bin/llc test.ll -max-jump-table-size=20 The default value for llvm-project/llvm/lib/CodeGen/TargetLoweringBase.cpp Lines 77 to 79 in 2306f89
I think maybe we can consider setting the default value for this to a lower value. This is currently not a virtual function, but I guess we can change that so that we can override it? llvm-project/llvm/include/llvm/CodeGen/TargetLowering.h Lines 1850 to 1852 in d937836
By the way I am not sure about the number of cases for which |
There is But now you've pointed out the density logic, I've also noticed there's also a
AFAICT, for the optimizing compiler, it is target dependent as high-level switch constructs are effectively passed straight to the backend, which is quite nice. But currently codegen, for aarch64, isn't great here as we don't optimize out the 'empty' cases and this is something I'm still thinking about. For liftoff, the baseline compiler, it appears we generate a sequence of compare-branch :( |
Aha there's also |
|
As @sparker-arm already said: in Turbofan generally yes, if there are more than 4 cases and a few other heuristics are met. I don't know how these rules have been derived, nor how optimal they are.
That certainly depends on the hardware (pipeline length, branch predictor strength, relative costs of computed and conditional branches, ...), and also on the use case (frequency distribution of cases). As an example: in our Wasm instruction decoder, we have empirically determined that we get best results with a hybrid approach, roughly:
but of course the fact that pulling out exactly two common cases works best is due to the typical distribution of Wasm instructions in Wasm functions, and doesn't necessarily carry over to any other use case. |
@jakobkummerow Thanks for the explanation!
Can you elaborate what "we don't optimize out the 'empty' cases" means, and what kind of option or optimization you would like to have? I'm not sure if limiting the max jump table size for wasm wholesale is a good solution because for other architectures it might help performance even though code size is bigger. It looks you can control this behavior by passing |
I honestly haven't looked into the performance aspect yet. From a V8 perspective, x64 codegen certainly looks better albeit about two times larger than I would have expected (aarch64 is 3x).
In the given example, with LLVM IR with three cases, we can generate a br_table with 20(?) branch targets. I refer to most of these targets as being 'empty' because they're really just branches to the default target. Currently, V8 will create a branch target block for all those cases. I think we should be able to clean up the CFG later, with V8's jump threading pass, but there are currently complications with how tables are lowered for aarch64 (not an LLVM problem!)
Thanks, but I'm not really looking at playing with developer options, I just want to determine whether the behaviour I've encountered is expected. If so, then we probably need to look into br_table lowering in V8. |
So TurboFan's jump table lowering doesn't work in AArch64?
I see. So yeah, the current behavior is at least not a bug, and it looks it thinks lowering to a I ran Emscripten core benchmarks, in non-LTO and LTO modes, and while I haven't compared individual program's size, the aggregate total size of all programs is very similar for these three options: (The default optimization flag here is
The difference between them is less than <0.02%. So yeah, I'm not sure if I should put a random threshold by default at this point, given that it doesn't seem to affect code size a lot in practice and I don't fully understand its performance implications. |
Fair enough, thanks for getting some numbers.
It works, but codegen looks sub-optimal. I think we should be able to do better for all targets with 'sparse' tables. Thanks for your help. |
The WebAssembly backend is very keen to lower switches to branch tables, with the argument being that it is better for code size. I've found that this isn't the case, and that the behaviour of switch lowering is dependent on the case values. The following two functions are the same, except for the case values, but the first is lowered to a
br_table
whereas the second isn't:Using a
br_table
results in the first function being 53 bytes, versus 37 when three conditional branches.I noticed this when looking at some rather ugly codegen from V8, as having a largely empty
br_table
currently isn't optimised. Is there something LLVM can do about this, or is this something consumers just need to handle?From a brief browse, I don't think the current interface between the backend and switch lowering is sufficient, and I don't understand why the specific case values are affecting the behaviour.
The text was updated successfully, but these errors were encountered: