Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WebAssembly] Switch lowering #63909

Closed
sparker-arm opened this issue Jul 17, 2023 · 10 comments
Closed

[WebAssembly] Switch lowering #63909

sparker-arm opened this issue Jul 17, 2023 · 10 comments

Comments

@sparker-arm
Copy link
Contributor

The WebAssembly backend is very keen to lower switches to branch tables, with the argument being that it is better for code size. I've found that this isn't the case, and that the behaviour of switch lowering is dependent on the case values. The following two functions are the same, except for the case values, but the first is lowered to a br_table whereas the second isn't:

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn writeonly
define hidden void @_Z25conditional_store_1_20_50Pii(i32* nocapture noundef writeonly %0, i32 noundef %1) local_unnamed_addr #0 {
  switch i32 %1, label %4 [
    i32 50, label %3
    i32 20, label %3
    i32 1, label %3
  ]

3:                                                ; preds = %2, %2, %2
  store i32 %1, i32* %0, align 4, !tbaa !2
  br label %4

4:                                                ; preds = %2, %3
  ret void
}

define hidden void @_Z26conditional_store_1_50_100Pii(i32* nocapture noundef writeonly %0, i32 noundef %1) local_unnamed_addr #0 {
  switch i32 %1, label %4 [
    i32 100, label %3
    i32 50, label %3
    i32 1, label %3
  ]

3:                                                ; preds = %2, %2, %2
  store i32 %1, i32* %0, align 4, !tbaa !2
  br label %4

4:                                                ; preds = %2, %3
  ret void
}

Using a br_table results in the first function being 53 bytes, versus 37 when three conditional branches.

000071 func[1] <_Z25conditional_store_1_20_50Pii>:
 000072: 02 40                      | block
 000074: 02 40                      |   block
 000076: 02 40                      |     block
 000078: 20 01                      |       local.get 1
 00007a: 41 7f                      |       i32.const 4294967295
 00007c: 6a                         |       i32.add
 00007d: 0e 14 01 02 02 02 02 02 02 |       br_table 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 0
 000086: 02 02 02 02 02 02 02 02 02 | 
 00008f: 02 02 02 01 00             | 
 000094: 0b                         |     end
 000095: 20 01                      |     local.get 1
 000097: 41 32                      |     i32.const 50
 000099: 47                         |     i32.ne
 00009a: 0d 01                      |     br_if 1
 00009c: 0b                         |   end
 00009d: 20 00                      |   local.get 0
 00009f: 20 01                      |   local.get 1
 0000a1: 36 02 00                   |   i32.store 2 0
 0000a4: 0b                         | end
 0000a5: 0b                         | end

I noticed this when looking at some rather ugly codegen from V8, as having a largely empty br_table currently isn't optimised. Is there something LLVM can do about this, or is this something consumers just need to handle?

From a brief browse, I don't think the current interface between the backend and switch lowering is sufficient, and I don't understand why the specific case values are affecting the behaviour.

@llvmbot
Copy link
Collaborator

llvmbot commented Jul 17, 2023

@llvm/issue-subscribers-backend-webassembly

@aheejin
Copy link
Member

aheejin commented Jul 18, 2023

The reason the first function is lowered to br_table while the second one is not is the first function's case values are more 'dense', so the switch lowering algorithms think it is more suitable for jump tables. Relevant code:

void SwitchCG::SwitchLowering::findJumpTables(CaseClusterVector &Clusters,
const SwitchInst *SI,
MachineBasicBlock *DefaultMBB,
ProfileSummaryInfo *PSI,
BlockFrequencyInfo *BFI) {
#ifndef NDEBUG
// Clusters must be non-empty, sorted, and only contain Range clusters.
assert(!Clusters.empty());
for (CaseCluster &C : Clusters)
assert(C.Kind == CC_Range);
for (unsigned i = 1, e = Clusters.size(); i < e; ++i)
assert(Clusters[i - 1].High->getValue().slt(Clusters[i].Low->getValue()));
#endif
assert(TLI && "TLI not set!");
if (!TLI->areJTsAllowed(SI->getParent()->getParent()))
return;
const unsigned MinJumpTableEntries = TLI->getMinimumJumpTableEntries();
const unsigned SmallNumberOfEntries = MinJumpTableEntries / 2;
// Bail if not enough cases.
const int64_t N = Clusters.size();
if (N < 2 || N < MinJumpTableEntries)
return;
// Accumulated number of cases in each cluster and those prior to it.
SmallVector<unsigned, 8> TotalCases(N);
for (unsigned i = 0; i < N; ++i) {
const APInt &Hi = Clusters[i].High->getValue();
const APInt &Lo = Clusters[i].Low->getValue();
TotalCases[i] = (Hi - Lo).getLimitedValue() + 1;
if (i != 0)
TotalCases[i] += TotalCases[i - 1];
}
uint64_t Range = getJumpTableRange(Clusters,0, N - 1);
uint64_t NumCases = getJumpTableNumCases(TotalCases, 0, N - 1);
assert(NumCases < UINT64_MAX / 100);
assert(Range >= NumCases);
// Cheap case: the whole range may be suitable for jump table.
if (TLI->isSuitableForJumpTable(SI, NumCases, Range, PSI, BFI)) {
CaseCluster JTCluster;
if (buildJumpTable(Clusters, 0, N - 1, SI, DefaultMBB, JTCluster)) {
Clusters[0] = JTCluster;
Clusters.resize(1);
return;
}
}
// The algorithm below is not suitable for -O0.
if (TM->getOptLevel() == CodeGenOpt::None)
return;
// Split Clusters into minimum number of dense partitions. The algorithm uses
// the same idea as Kannan & Proebsting "Correction to 'Producing Good Code
// for the Case Statement'" (1994), but builds the MinPartitions array in
// reverse order to make it easier to reconstruct the partitions in ascending
// order. In the choice between two optimal partitionings, it picks the one
// which yields more jump tables.
// MinPartitions[i] is the minimum nbr of partitions of Clusters[i..N-1].
SmallVector<unsigned, 8> MinPartitions(N);
// LastElement[i] is the last element of the partition starting at i.
SmallVector<unsigned, 8> LastElement(N);
// PartitionsScore[i] is used to break ties when choosing between two
// partitionings resulting in the same number of partitions.
SmallVector<unsigned, 8> PartitionsScore(N);
// For PartitionsScore, a small number of comparisons is considered as good as
// a jump table and a single comparison is considered better than a jump
// table.
enum PartitionScores : unsigned {
NoTable = 0,
Table = 1,
FewCases = 1,
SingleCase = 2
};
// Base case: There is only one way to partition Clusters[N-1].
MinPartitions[N - 1] = 1;
LastElement[N - 1] = N - 1;
PartitionsScore[N - 1] = PartitionScores::SingleCase;
// Note: loop indexes are signed to avoid underflow.
for (int64_t i = N - 2; i >= 0; i--) {
// Find optimal partitioning of Clusters[i..N-1].
// Baseline: Put Clusters[i] into a partition on its own.
MinPartitions[i] = MinPartitions[i + 1] + 1;
LastElement[i] = i;
PartitionsScore[i] = PartitionsScore[i + 1] + PartitionScores::SingleCase;
// Search for a solution that results in fewer partitions.
for (int64_t j = N - 1; j > i; j--) {
// Try building a partition from Clusters[i..j].
Range = getJumpTableRange(Clusters, i, j);
NumCases = getJumpTableNumCases(TotalCases, i, j);
assert(NumCases < UINT64_MAX / 100);
assert(Range >= NumCases);
if (TLI->isSuitableForJumpTable(SI, NumCases, Range, PSI, BFI)) {
unsigned NumPartitions = 1 + (j == N - 1 ? 0 : MinPartitions[j + 1]);
unsigned Score = j == N - 1 ? 0 : PartitionsScore[j + 1];
int64_t NumEntries = j - i + 1;
if (NumEntries == 1)
Score += PartitionScores::SingleCase;
else if (NumEntries <= SmallNumberOfEntries)
Score += PartitionScores::FewCases;
else if (NumEntries >= MinJumpTableEntries)
Score += PartitionScores::Table;
// If this leads to fewer partitions, or to the same number of
// partitions with better score, it is a better partitioning.
if (NumPartitions < MinPartitions[i] ||
(NumPartitions == MinPartitions[i] && Score > PartitionsScore[i])) {
MinPartitions[i] = NumPartitions;
LastElement[i] = j;
PartitionsScore[i] = Score;
}
}
}
}
// Iterate over the partitions, replacing some with jump tables in-place.
unsigned DstIndex = 0;
for (unsigned First = 0, Last; First < N; First = Last + 1) {
Last = LastElement[First];
assert(Last >= First);
assert(DstIndex <= First);
unsigned NumClusters = Last - First + 1;
CaseCluster JTCluster;
if (NumClusters >= MinJumpTableEntries &&
buildJumpTable(Clusters, First, Last, SI, DefaultMBB, JTCluster)) {
Clusters[DstIndex++] = JTCluster;
} else {
for (unsigned I = First; I <= Last; ++I)
std::memmove(&Clusters[DstIndex++], &Clusters[I], sizeof(Clusters[I]));
}
}
Clusters.resize(DstIndex);
}

bool TargetLoweringBase::isSuitableForJumpTable(const SwitchInst *SI,
uint64_t NumCases,
uint64_t Range,
ProfileSummaryInfo *PSI,
BlockFrequencyInfo *BFI) const {
// FIXME: This function check the maximum table size and density, but the
// minimum size is not checked. It would be nice if the minimum size is
// also combined within this function. Currently, the minimum size check is
// performed in findJumpTable() in SelectionDAGBuiler and
// getEstimatedNumberOfCaseClusters() in BasicTTIImpl.
const bool OptForSize =
SI->getParent()->getParent()->hasOptSize() ||
llvm::shouldOptimizeForSize(SI->getParent(), PSI, BFI);
const unsigned MinDensity = getMinimumJumpTableDensity(OptForSize);
const unsigned MaxJumpTableSize = getMaximumJumpTableSize();
// Check whether the number of cases is small enough and
// the range is dense enough for a jump table.
return (OptForSize || Range <= MaxJumpTableSize) &&
(NumCases * 100 >= Range * MinDensity);
}

But yeah, as you suggested, even though the first one is 'dense' from LLVM's perspective, it may be still too large for br_table to be profitable. I think we can limit the maximum size of the jump table using -max-jump-table-size option. So if we do this, the first function also uses br_if:

~/llvm-git/install.debug/bin/llc test.ll -max-jump-table-size=19

while this is lowered down to br_table:

~/llvm-git/install.debug/bin/llc test.ll -max-jump-table-size=20

The default value for -max-jump-table-size is UINT_MAX:

static cl::opt<unsigned> MaximumJumpTableSize
("max-jump-table-size", cl::init(UINT_MAX), cl::Hidden,
cl::desc("Set maximum size of jump tables."));

I think maybe we can consider setting the default value for this to a lower value. This is currently not a virtual function, but I guess we can change that so that we can override it?

/// Return upper limit for number of entries in a jump table.
/// Zero if no limit.
unsigned getMaximumJumpTableSize() const;

By the way I am not sure about the number of cases for which br_table can be better than the alternative in V8. Does V8 lower br_tables to jump tables internally? If so, how many case values are desirable and how many are detrimental to the performance? @backes @thibaudmichaud

@sparker-arm
Copy link
Contributor Author

This is currently not a virtual function, but I guess we can change that so that we can override it?

There is TargetLoweringBase::setMaximumJumpTableSize though, which is used in the AArch64 backend.

But now you've pointed out the density logic, I've also noticed there's also a jump-table-density option, so maybe this can help out as well.

Does V8 lower br_tables to jump tables internally?

AFAICT, for the optimizing compiler, it is target dependent as high-level switch constructs are effectively passed straight to the backend, which is quite nice. But currently codegen, for aarch64, isn't great here as we don't optimize out the 'empty' cases and this is something I'm still thinking about.

For liftoff, the baseline compiler, it appears we generate a sequence of compare-branch :(

@aheejin
Copy link
Member

aheejin commented Jul 18, 2023

Aha there's also setMaximumJumpTableSize. Yeah, it would make sense to restrict it to a certain value once we have more info on V8. jump-table-density doesn't seen to have the corresponding set method (but we can add one I guess if necessary).

@aheejin
Copy link
Member

aheejin commented Jul 20, 2023

By the way I am not sure about the number of cases for which br_table can be better than the alternative in V8. Does V8 lower br_tables to jump tables internally? If so, how many case values are desirable and how many are detrimental to the performance? @backes @thibaudmichaud

cc @gahaas @jakobkummerow

@jakobkummerow
Copy link

Does V8 lower br_tables to jump tables internally?

As @sparker-arm already said: in Turbofan generally yes, if there are more than 4 cases and a few other heuristics are met. I don't know how these rules have been derived, nor how optimal they are.
In Liftoff, we emit a binary search cascade, with a comment TODO: Generate a real branch table that apparently so far wasn't urgent enough to spend the time. Within certain limits, the code generated by Liftoff is not performance-critical: if a function is hot enough to matter, we'll optimize it anyway.

how many case values are desirable and how many are detrimental to the performance?

That certainly depends on the hardware (pipeline length, branch predictor strength, relative costs of computed and conditional branches, ...), and also on the use case (frequency distribution of cases).

As an example: in our Wasm instruction decoder, we have empirically determined that we get best results with a hybrid approach, roughly:

if (opcode == kMostCommonCase) {
  // handle that...
} else if (opcode == kSecondMostCommonCase) {
  // handle that...
} else {
  // jump table with 256 entries to handle everything else
}

but of course the fact that pulling out exactly two common cases works best is due to the typical distribution of Wasm instructions in Wasm functions, and doesn't necessarily carry over to any other use case.

@aheejin
Copy link
Member

aheejin commented Jul 21, 2023

@jakobkummerow Thanks for the explanation!

@sparker-arm

AFAICT, for the optimizing compiler, it is target dependent as high-level switch constructs are effectively passed straight to the backend, which is quite nice. But currently codegen, for aarch64, isn't great here as we don't optimize out the 'empty' cases and this is something I'm still thinking about.

Can you elaborate what "we don't optimize out the 'empty' cases" means, and what kind of option or optimization you would like to have? I'm not sure if limiting the max jump table size for wasm wholesale is a good solution because for other architectures it might help performance even though code size is bigger. It looks you can control this behavior by passing -mllvm -max-jump-table-size=N to clang though. Would that work for you?

@sparker-arm
Copy link
Contributor Author

I'm not sure if limiting the max jump table size for wasm wholesale is a good solution because for other architectures it might help performance even though code size is bigger.

I honestly haven't looked into the performance aspect yet. From a V8 perspective, x64 codegen certainly looks better albeit about two times larger than I would have expected (aarch64 is 3x).

Can you elaborate what "we don't optimize out the 'empty' cases" means, and what kind of option or optimization you would like to have?

In the given example, with LLVM IR with three cases, we can generate a br_table with 20(?) branch targets. I refer to most of these targets as being 'empty' because they're really just branches to the default target. Currently, V8 will create a branch target block for all those cases. I think we should be able to clean up the CFG later, with V8's jump threading pass, but there are currently complications with how tables are lowered for aarch64 (not an LLVM problem!)

It looks you can control this behavior by passing -mllvm -max-jump-table-size=N to clang though. Would that work for you?

Thanks, but I'm not really looking at playing with developer options, I just want to determine whether the behaviour I've encountered is expected. If so, then we probably need to look into br_table lowering in V8.

@aheejin
Copy link
Member

aheejin commented Jul 24, 2023

Can you elaborate what "we don't optimize out the 'empty' cases" means, and what kind of option or optimization you would like to have?

In the given example, with LLVM IR with three cases, we can generate a br_table with 20(?) branch targets. I refer to most of these targets as being 'empty' because they're really just branches to the default target. Currently, V8 will create a branch target block for all those cases. I think we should be able to clean up the CFG later, with V8's jump threading pass, but there are currently complications with how tables are lowered for aarch64 (not an LLVM problem!)

So TurboFan's jump table lowering doesn't work in AArch64?

It looks you can control this behavior by passing -mllvm -max-jump-table-size=N to clang though. Would that work for you?

Thanks, but I'm not really looking at playing with developer options, I just want to determine whether the behaviour I've encountered is expected. If so, then we probably need to look into br_table lowering in V8.

I see. So yeah, the current behavior is at least not a bug, and it looks it thinks lowering to a br_table (and thus to a jump table) when it is feasible is generally profitable. Wasm mostly follow the common settings, except for b9f282d. Note that this is related to the number of switch cases and not necessarily the jump table size, which is the issue here.

I ran Emscripten core benchmarks, in non-LTO and LTO modes, and while I haven't compared individual program's size, the aggregate total size of all programs is very similar for these three options: (The default optimization flag here is -O3, but some benchmarks have their own flags)

  • Minimum jump table entries == 2 (current): 9459097
  • Minimum jump table entries == 4 (default, if we haven't overridden it): 9461314
  • Minimum jump table entries == 2 && maximum jump table size == 15: 9459785 (Here 15 is chosen arbitrarily to prevent your example from being lowered to br_if)

The difference between them is less than <0.02%.

So yeah, I'm not sure if I should put a random threshold by default at this point, given that it doesn't seem to affect code size a lot in practice and I don't fully understand its performance implications.

@sparker-arm
Copy link
Contributor Author

The difference between them is less than <0.02%.

Fair enough, thanks for getting some numbers.

So TurboFan's jump table lowering doesn't work in AArch64?

It works, but codegen looks sub-optimal. I think we should be able to do better for all targets with 'sparse' tables.

Thanks for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants