-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Benchmarking of Slacked Gas Metering performance impact #11677
Comments
Here are some benchmark numbers. The times in the first three benchmarks are in microseconds, and for the last benchmark they're in seconds. call_empty_function (this measures instantiation overhead when no memory is touched)
dirty_1mb_of_memory (this mostly measures instantiation overhead when a bunch of memory is dirtied)
burn_cpu_cycles (this does a lot of work summing numbers in a loop essentially burning a lot of CPU inside of WASM)
block_production (a more real-world benchmark which produces a block with 5925 transactions)
This was ran with 1ms interval to trigger the fuel check signal handler, with the same I'll also check the vanilla @pepyakin How short do think we'll want to go with the fuel check interval? (Assuming we're going to use this in production.) |
Some more numbers. Everything is in microseconds, and for all of the benchmarks their 1 thread variant was run. Exactly the same base commit of time/iter [us]
deviation (+/-) [us]
Unless I did something wrong (and I checked the results manually, so I don't think so?) it seems like the vanilla fuel metering is roughly just as efficient as this async one. (Maybe a little bit less so, but there's a lot of variation in the result, so it's kind of a wash.) So either we don't need it, or the benchmarks have a gap where they don't stress a scenario where the async metering would be a win. I'll try again with more of an end-to-end test running a full node to compare the |
Those are fascinating findings. Unless my benchmarks were broken, they showed vanilla had a considerable overhead. The lowest difference was in keccak and the highest difference was in regex and wasmi. The question is now how that applies to workflows we are interested in (and even, what those workflows are). The situation with wasmi is concerning since I anticipate that people would like to implement STFs that use VMs. So I think we should test that. Would you be able to hack up a test case with wasmi? I also hear that seal/ink is going to use wasmi compiled in wasm for contracts initially, so maybe @athei can help with that. Then as I mentioned in OP, we should evaluate the numbers on our users' workflows. That is, try to gather the numbers for the workloads that parachains teams think are representative. Moonbeam also uses EVM which is VM, so it would be doubly interesting. Risking stating the obvious, but if vanilla works as well as the async version, the vanilla wins hands down. Vanilla is already implemented, it is easier to test and reason about, and it does not require additional low-level machinery. For our use case, it also does not require any additional security assumptions. |
Okay, indeed, the difference can be bigger depending on the benchmark. I ported the regex_redux benchmark into our test runtime (it uses wasmi from within the runtime to interpret a WASM kernel which uses the regex crate) and here are the results:
So the overhead is very highly dependent on the exact workload we're running, although the difference (while it is there and is significant) is not as bad as I'd expect it. The run with vanilla fuel checks takes ~125% of the baseline time and the runs with the async fuel checks take ~110%. An interesting note here is that I had to significantly increase the fuel amount to get this benchmark to run. The heaviest benchmark I ran before (the block production one) required me to set the fuel (and this is not exact, since I was just adding zeros at the end until it worked) to 100 million, while this new benchmark required it to be set to 10000 million, so two orders of magnitude more. I'll maybe try to jig a benchmark with Moonbeam and see what's the difference there. I've found our smart-bench repo so maybe that can be easily repurposed to test this. |
Those numbers are significantly different from whatever I saw locally and I am curious to know why we see this difference. Could you update your branch if you haven't so that I can reproduce? Those are fascinating results. It looks like maybe using vanilla can be used already. Later on we might optimize it with the slacked metering it needed and it is worth it. For that we can refocus on implementing the schedule: I assume cost 1 for each instruction is not good. Then regarding the transfers, I assume the results we were seeing is because the most execution happens in the node/IO? |
Yeah, it'd be a good idea if another set of eyes could double-check those results. I've pushed all of the changes to my branch; go into the
I could check if necessary, but it's certainly doing less work in the WASM. It's still not an insignificant amount of work though, since I do have to set the fuel fairly high for it to work. |
Below are the results of running The numbers in ns so I converted them into us. Weird thing I had to increase the amount of fuel for M1 Max by adding 3 zeroes. I also took the liberty to reformat the table, I find it way easier to read it this way, hope it's not troubling. The percentages are normalized to the no_fuel case on the same machine.
The numbers overall are in line with what I observed before. I guess the differences we see can be explained by the different microarchitectures. This is a bit annoying since we now need to pick the golden machine. |
Those are very interesting results, I wouldn't expect that a newer microarchitecture would make the fuel checks essentially zero cost. This is a stupid question, but you did enable the fuel checks for the async fuel benchmarks and if you lower the amount of fuel it interrupts itself properly, right? (: Anyway, I just had an idea. I ran a quick experiment: I took your diff --git a/crates/cranelift/src/func_environ.rs b/crates/cranelift/src/func_environ.rs
index 5acb66ec7..76f22d38e 100644
--- a/crates/cranelift/src/func_environ.rs
+++ b/crates/cranelift/src/func_environ.rs
@@ -355,7 +355,7 @@ impl<'module_environment> FuncEnvironment<'module_environment> {
fn fuel_function_entry(&mut self, builder: &mut FunctionBuilder<'_>) {
// self.fuel_load_into_var(builder);
- // self.fuel_check(builder);
+ self.fuel_check(builder);
}
fn fuel_function_exit(&mut self, builder: &mut FunctionBuilder<'_>) {
@@ -526,7 +526,7 @@ impl<'module_environment> FuncEnvironment<'module_environment> {
/// Checks the amount of remaining, and if we've run out of fuel we call
/// the out-of-fuel function.
fn fuel_check(&mut self, builder: &mut FunctionBuilder) {
- self.fuel_increment_var(builder);
+ // self.fuel_increment_var(builder);
let out_of_gas_block = builder.create_block();
builder.set_cold_block(out_of_gas_block);
let continuation_block = builder.create_block();
@@ -554,7 +554,7 @@ impl<'module_environment> FuncEnvironment<'module_environment> {
// Note that we save/reload fuel around this since the out-of-gas
// intrinsic may alter how much fuel is in the system.
builder.switch_to_block(out_of_gas_block);
- self.fuel_save_from_var(builder);
+ // self.fuel_save_from_var(builder);
let out_of_gas_sig = self.builtin_function_signatures.out_of_gas(builder.func);
let (vmctx, out_of_gas) = self.translate_load_builtin_function_address(
&mut builder.cursor(),
@@ -563,7 +563,7 @@ impl<'module_environment> FuncEnvironment<'module_environment> {
builder
.ins()
.call_indirect(out_of_gas_sig, out_of_gas, &[vmctx]);
- self.fuel_load_into_var(builder);
+ // self.fuel_load_into_var(builder);
builder.ins().jump(continuation_block, &[]);
builder.seal_block(continuation_block); and ran the benchmark with it; here are the results:
So basically this is a hybrid of the vanilla fuel checks where it checks for fuel synchronously but it uses your pinned register instead of storing it in a variable, and the performance is still quite competitive. And I can confirm that the fuel checks work here - if I lower the fuel it gets interrupted properly. So maybe this could be a still-fast-but-simpler alternative to going full async? Could you also check how this performs on your machines? Command line invocation to run this after applying the patch and switching to that
(Note the |
I just re-ran all the benchmarks and can confirm that they result in the very similar numbers. Lowering the fuel available results in a proper interrupt, which indicates it actually doing metering. Regarding the vanilla fuel + pinned reg. I think your snippet lacks the fuel check in the loop headers. I uncommented it there and run the benchmarks. I actually tried this (I think revisions before b36a6bae3ba945cb83f758097cbd382a51e24746 correspond to that approach) and was not satisfied with the results. In my local benchmarks (based on wasmi keccak, regex redux and rev_comp), it shows pretty nasty regressions around -10% on keccak and rev_comp and -21% on regex_redux (FWIW, they do not include a separate interrupter thread). However, for your run2.sh benchmark, it performs way better on x5950: 367594 us (1.6% increase from no fuel) |
M1 results
I cross-checked the result several times, and yes, it seems that the vanilla metering that leverages pinned register is slower than flushing/reloading the counter allowing it to spill. |
And to top it off, run1.sh on x5950 shows similar performance on the block production benchmark.
|
From what I understand with what @pepyakin told me in DM, that this fuel metering is baked into the Wasm runtime, and exists whether or not we use it. If so, can we generate a Wasm runtime with the fuel metering, and simply run all the runtime benchmarks to see the impact of each extrinsic? |
To give the context, When you initialize wasmtime, you need to create an engine. You can configure it to embed the fuel metering. If you do so, then all the modules you create with that engine will consume gas. When you create a module from wasm bytecode, wasmtime will translate it to the native code. In case the the discussion was about introducing, perhaps, a special host function that would execute the given function from the module and do so with metering enabled. For this to work, there are two options:
So with that clarification out of the way,
Yes, I think this is possible and easy to achieve. We need just to:
Those two steps should be enough to run the benchmarks. |
If my math is right the pinned reg percentage should be 0.2% here, no? (: Which is faster than even the async fuel run, but from what I've seen in my runs I'm assuming this is in the margin of error and they're essentially comparable here.
I guess this is a side effect of M1's huge internal register file; since that memory location is accessed so often it's kept in a register internally anyway, but by letting the hardware manage it we're not hogging one ISA register from being used. Anyway, I think we do now have conclusive evidence that:
Wouldn't it be enough to just set the fuel to |
Oh god, I am not sure what happened there. Decided not to risk it and re-run:
When I said "replenish as needed" I did not mean it between calls but within a call. If you run the code without metering you should never observe OOG and that's why I felt to pedantically1 mention that. But you are right,
Yes, agreed. From here, we should look at what's the impact on extrinsics. Maybe we should run the same tests on the reference hardware? Footnotes
|
Yep. I'm currently doing that; I got a moonbeam node running, and got the EVM benches from the smart-bench running. It's little awkward and slow to run, but it works. The benches don't provide actual real execution time, but I hacked that in and it seems to work and is reliable enough from run-to-run; the execution time increases the more extrinsic I queue, and the times are mostly consistent between independent runs. So I'll just port over all of the necessary substrate changes and get some numbers. I can also run these on the reference hardware since I did get access to it a few months back.
Yeah, that's correct. |
Here are the numbers for block execution on Moonbeam running 256 extrinsics of the erc20 contract benchmark in a single block on my Threadripper 3070x:
This looks really really good. At least for this benchmark there's essentially no impact on performance with the async fuel checks. (I made sure to check that lowering the fuel breaks the execution and it does, so metering is working.) I'll grab numbers on our reference hardware next week. |
Here are the numbers from our reference hardware. (i7-7700K running on bare metal) time/iter [us]
deviation
And here are numbers for block execution on Moonbeam (exactly the same benchmark as the last time):
Looks pretty good to me. So (assuming we want to use this) I guess the next step would be to get the implementation to be less hacky and get it upstreamed? |
I think we would need to perform more tests. Specifically:
I think it's worth starting iterating on a better prototype. Right now, there is no flushing and checking the fuel counter before the libcalls. I don't think this will affect the performance, but it's better to be safe than sorry. Besides that, there is some non-benchmark-related work. Namely, we should change wasmtime in a way that allows customizing the costs of instructions. Then, we should look into the security-related implications. Those things come to mind:
After doing all that, we may consider switching to fully metered execution. |
So just rerun all of our weight generation machinery with this enabled? Yeah, I think I can do that.
What would that entail exactly? I can prepare a
With the current setup that I have I can now easily run any of the benches from the smart-bench (and AFAIK it can also run wasmi-based benchmarks), so if we could get some good representative benchmarks that are more gas-heavy into there I could run them. For EVM the only benches in there are I could maybe jig some new benchmark contracts up, but it isn't exactly my area of expertise so I'd be nice if someone could lend us a hand here.
Agreed.
Good question. The counter is 64-bit, right? In which case I don't think it should be possible to directly overflow it assuming each instruction consumes ~1 fuel. Consider the following program: #![feature(bench_black_box)]
fn main() {
for nth in 0..u64::MAX {
std::hint::black_box(nth);
}
} This compiles down to the following assembly:
On my machine in roughly ~25 seconds this has counted up to 105100371379, so assuming the full range of a u64 it'd roughly take 139 years for this to overflow, and this is pretty much a best case scenario. So unless the costs of instructions will be orders of magnitude higher or there's an indirect way to influence/overwrite the counter I don't think an overflow is realistic?
Yeah, this is going to be really tough, because the actual execution time will vary widely depending on which instructions cranelift's going to emit, on which exact microarchitecture it's going to run, and also on the neighboring code (cache pressure, register pressure, etc.). I guess maybe we could take a page out of what we're doing right now with the weights, and rig up some benchmarks which would measure this on reference hardware, so that would simplify it a little bit. It's definitely not going to be perfect though. I don't have a good answer to this. I think we'd need to experiment here. |
Each instruction takes 1 fuel currently, but as I mentioned in the last message, we should override that behavior in wasmtime with a proper schedule. We would probably want to assign a value that is higher than 1 for the cheapest instruction. We may want to accommodate for quicker hardware and/or compilers in the future. Maybe setting the cheapest instruction to 10-100 would make sense. Then, the counter is 64-bit. However, it is set up in such a way, that the amount of fuel is represented as a negative value. For each basic block, we increment the counter with the corresponding cost of that basic block. Should the counter become positive, that means the fuel ran out. That essentially makes the counter 32-bit.
That basically means the counter has |fuel amount| + 2^32 until the overflow. The interrupt has 2^32 units to catch the overflow: from 0 to 0x7FFFFFFF_FFFFFFFF. In case that does not happen, there will be another chance to do it again when the counter gets into the positive area. If we do the synchronous check before calling into the host functions and libcalls, then the attacker will be only left with wasm code. Even things like We can also do a couple of things:
Those are thoughts off the top of my head. We need to dig deeper into this and make sure it is sound.
Sure. We don't need to be absolutely precise though. We just need to make sure either:
In either case, we have a big margin for error it seems. In case the variability is low and the worst case is not far away from the best case, there will be more computation available for the PVF authors. I think this may look something like what @athei did for measuring contracts.
Hopefully, it's not too naive, to assume that we can politely ask the parachains teams to run the workflows they care about on an experimental version. This would give us additional information on whether we are regressing the performance too much. I am not sure what that would look like, but an experimental branch from a release should work. |
I think for a start you could run the
What we need to do is benchmark every instruction and every host function. This is exactly what the |
Here are some numbers from our weight benchmarks. They were run locally on my 3070x. I'll post the results from the reference hardware later (they're still running, since I queued up more pallets there). The raw weights were all divided by 1000 to make the table easier to read. Benchmark results
NotesSome of the benchmarks seem... broken? Few of the benchmarks sometimes randomly didn't emit the results at all, or randomly just returned a weight of 0. I'm not sure if I was doing something wrong maybe? Here are the issues that my post-processing script filtered out: (the "missing weights" entries are for those which sometimes don't get emitted at all, and "zero weights" are for those which sometimes just return a 0)
The benchmarks were run with the following command:
|
Okay, I got the numbers from the reference hardware. However, after analyzing how these are calculated I don't think these numbers are appropriate for comparison purposes. The reason for that is twofold:
Fortunately the benchmarking command has a command line argument to get it to output the raw measurements it made as a So, here are the numbers anyway since I've already gathered the data, but please take them with a huge grain of salt. I'll post new numbers next week generated from the raw measurements. Benchmark results
|
FWIW: All the contract benchmarks which have
Is there an issue for this bug? |
Okay, I've made an issue describing this. |
Newest results from the reference hardware based on raw measurements. All numbers are in % relative to the baseline case when running with no gas metering. For extrinsics with components the values you see here are averages over percentage scores for each unique set of component values. (I could generate a table with those not averaged, but it'd be insanely long.) Benchmark results
And here's a table with top-50 slowest Benchmark results
The Benchmark results
Maybe by accident this generates worse assembly under the The case of Benchmark results
|
I can't explain why fn to_eth_address(&self) -> Result<[u8; 20], ()> {
use k256::{elliptic_curve::sec1::ToEncodedPoint, PublicKey};
PublicKey::from_sec1_bytes(self.as_slice()).map_err(drop).and_then(|pub_key| {
// uncompress the key
let uncompressed = pub_key.to_encoded_point(false);
// convert to ETH address
<[u8; 20]>::try_from(
sp_io::hashing::keccak_256(&uncompressed.as_bytes()[1..])[12..].as_ref(),
)
.map_err(drop)
})
} The hashing is delegated to the client but |
Maybe not even by an accident, but one less register available shows? |
This task aims to:
In that fork, I abused the
consume_fuel
feature making the out-of-gas checks async.The usage can be seen in the example:
My changes are incomplete, and some parts of the wasmtime do not work in my branch. Specifically, it lacks the typed func support. Only Linux x86_64 is supported. The API is subject to a change.
My proposal to approach this is:
Since the host functions are now untyped, this affects the performance. Therefore, the baseline (for the no-fuel metering scenario) should also be collected with untyped host functions.
Bonus points:
consume_fuel
feature in the current upstream wasmtime version.There may be bugs in the implementation. Please share if any is discovered.
The text was updated successfully, but these errors were encountered: