Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to a more recent version of wasmi #18

Merged
merged 11 commits into from
Feb 13, 2023
Merged

Conversation

tomaka
Copy link
Contributor

@tomaka tomaka commented Feb 9, 2023

Work in progress. Blocked by wasmi-labs/wasmi#658 and wasmi-labs/wasmi#659.

cc #17

@tomaka
Copy link
Contributor Author

tomaka commented Feb 10, 2023

I have absolutely no idea why but I'm getting out of bound memory accesses errors coming from wasmi.

I have made sure that the behavior of all the memory-growth-related functions are identical to before this PR and also identical to wasmtime.

@tomaka
Copy link
Contributor Author

tomaka commented Feb 10, 2023

@Robbepop I hope you don't mind pinging you in case you have an idea, as I'm a bit clueless.

After this PR, wasmi emits a trap due to out of bound memory accesses (Trap(Trap { reason: InstructionTrap(MemoryOutOfBounds) })).

This didn't happen before this PR, and also doesn't happen with wasmtime.

The PR was relatively straight forward. The code in itself is also relatively straight forward.

There could be a bug in the upper layers, but then I don't see why the problem wouldn't happen with wasmtime.

It could maybe be caused because the ext_allocator_malloc_version_1 host function is called, and the return value of that host function isn't properly injected back into the Wasm VM? Which would then make the Wasm code use an invalid pointer.

@Robbepop
Copy link

Robbepop commented Feb 10, 2023

Hmmm for questions like these it would have been amazing to have better debugging support built-into wasmi. 🤔
I haven't yet noticed any problems with respect to out-of-bounds memory accesses from wasmi users such as the contracts-pallet.
What does the called host function ext_allocator_malloc_version_1 exactly do and in which contexts is it called?
What instructions are causing the out of bounds errors?

edit: Are you using reference-types or bulk-memory Wasm proposals? Even though all Wasm spec testsuite tests are passing they are kind of new and not as much tested as the MVP functionality.

@tomaka
Copy link
Contributor Author

tomaka commented Feb 10, 2023

What does the called host function ext_allocator_malloc_version_1 exactly do and in which contexts is it called?

The Substrate/Polkadot client has its own memory allocator. It's not the Wasm code that manages its memory. Instead, it calls a host function to allocate memory and another host function to free it.
So ext_allocator_malloc_version_1 picks a not-allocated-yet memory location and returns a pointer to it. It grows the memory if necessary.

While the behavior of this function is very complicated, my suspicion would be that the return value of that function isn't communicated properly to the Wasm code, maybe due to a bug in the new resumable functions feature. The Wasm code would then use an incorrect value as if it was the pointer.

Of course I'm basing that suspicion on the fact that the code in smoldot has always been relatively robust, and based on what has changed between before and after this PR.

edit: Are you using reference-types or bulk-memory Wasm proposals? Even though all Wasm spec testsuite tests are passing they are kind of new and not as much tested as the MVP functionality.

I don't disable them in the config 🤷 Maybe bulk-memory is used, I have no idea.

@Robbepop
Copy link

Robbepop commented Feb 10, 2023

Yeah it could be connected to a bug in the relatively new resumable function feature. I will probably need some more tests for it. Unfortunately there is no Wasm spec testsuite for resumable functions. :(
It would greatly help if you can create a minimized fail case for wasmi if it is caused by the resumable function feature.

@tomaka tomaka mentioned this pull request Feb 10, 2023
@github-actions
Copy link

github-actions bot commented Feb 10, 2023

twiggy diff report

Difference in .wasm size before and after this pull request.


 Delta Bytes │ Item
─────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      -45996 ┊ smoldot::json_rpc::methods::MethodCall::from_defs::hb2a47243e59e25dc
      +45996 ┊ smoldot::json_rpc::methods::MethodCall::from_defs::hf88d34ec847a230f
      -40639 ┊ smoldot::executor::host::ReadyToRun::run_once::h043d454b780bb679
      +39035 ┊ smoldot::executor::host::ReadyToRun::run_once::h984e6f27f9550b95
      +26916 ┊ data[0]
      -21534 ┊ wasmi::runner::Interpreter::do_run_function::hd3eb522c06ab1432
      +19070 ┊ <wasmparser_nostd::readers::core::operators::Operator as core::fmt::Debug>::fmt::h2fc7455473655f17
      -17430 ┊ wasmi::prepare::compile::Compiler::compile_instruction::h0e5077d39e980dd8
      +12035 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h12137fe0900df598
      -12035 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h39e0827c6c665991
      +11925 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h10c5da09dc34d220
      -11925 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::he6eccb1360f3791f
      +11867 ┊ smoldot_light::Client<TPlat,TChain>::add_chain::h3ef438e8561ba726
      -11867 ┊ smoldot_light::Client<TPlat,TChain>::add_chain::h3f6d5b56f0d2a77d
      -11693 ┊ wasmi_validation::func::FunctionValidationContext::step::h1ff32704af6dabe9
      -11612 ┊ <parity_wasm::elements::ops::Instruction as parity_wasm::elements::Deserialize>::deserialize::he379c8d39818af57
       +9486 ┊ wasmi::module::parser::ModuleParser::process_payload::h3a59cdd8698dc800
       -9162 ┊ smoldot::json_rpc::methods::Response::to_json_response::h4e3669cccc1be00a
       +9162 ┊ smoldot::json_rpc::methods::Response::to_json_response::h9f525b7dcf5efa65
       -8908 ┊ <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_struct::h15b6bd6554906e90
     +337986 ┊ ... and 26628 more.
     +688687 ┊ Σ [26648 Total Rows]

@tomaka
Copy link
Contributor Author

tomaka commented Feb 10, 2023

Note for myself:

The easiest way to reproduce would be to call Metadata_metadata. Maybe Core_version as well, but Metadata_metadata surely triggers the issue.

The next step to debug would be to log the list of all functions being called, alongside with their parameters and return value, before and after this PR, and compare. If that doesn't give any hint, also log the calls to memory.grow.

@tomaka
Copy link
Contributor Author

tomaka commented Feb 12, 2023

My hypothesis was wrong, as the error happens after a single host function call to ext_logging_max_level_version_1. No call to ext_allocator_malloc_version_1 happens before the error.

However, one interesting thing is that the error doesn't happen with Polkadot runtime v0.9.16. It happens with Westend v0.9.30.

@tomaka
Copy link
Contributor Author

tomaka commented Feb 12, 2023

Disabling all Wasm features (such as bulk-memory) also doesn't change anything, which makes me think that the runtime doesn't any these features.

@tomaka
Copy link
Contributor Author

tomaka commented Feb 12, 2023

@Robbepop

I think I've managed to reproduce the problem.

Here's a "minimal" failing test:

#[test]
fn repro() {
    let engine = wasmi::Engine::default();
    let module = wasmi::Module::new(
        &engine,
        &include_bytes!("wasm_data.wasm")[..]
    )
    .unwrap();
    let mut store = wasmi::Store::new(&engine, ());

    #[derive(Debug, Clone)]
    struct InterruptedTrap {
        name: String,
    }
    impl core::fmt::Display for InterruptedTrap {
        fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
            write!(f, "Interrupted")
        }
    }
    impl wasmi::core::HostError for InterruptedTrap {}

    let mut linker = wasmi::Linker::<()>::new();
    for import in module.imports() {
        match import.ty() {
            wasmi::ExternType::Func(func_type) => {
                let name = import.name().to_owned();
                let func = wasmi::Func::new(
                    &mut store,
                    func_type.clone(),
                    move |_caller, _params, _ret| {
                        Err(wasmi::core::Trap::from(InterruptedTrap {
                            name: name.clone(),
                        }))
                    },
                );
                linker.define(import.module(), import.name(), func).unwrap();
            }
            wasmi::ExternType::Memory(memory_type) => {
                let memory = wasmi::Memory::new(&mut store, *memory_type).unwrap();
                linker
                    .define(import.module(), import.name(), memory)
                    .unwrap();
            }
            _ => unreachable!(),
        }
    }

    let instance = linker.instantiate(&mut store, &module).unwrap();
    let instance = instance.ensure_no_start(&mut store).unwrap();

    let mut out = [wasmi::Value::I64(0)];

    let mut call = instance
        .get_func(&store, "Core_version")
        .unwrap()
        .call_resumable(
            &mut store,
            &[wasmi::Value::I32(0), wasmi::Value::I32(0)],
            &mut out,
        )
        .unwrap();

    loop {
        match call {
            wasmi::ResumableCall::Resumable(r) => {
                let err = r.host_error().downcast_ref::<InterruptedTrap>().unwrap();
                if err.name == "ext_logging_max_level_version_1" {
                    call = r
                        .resume(&mut store, &[wasmi::Value::I32(0)], &mut out)
                        .unwrap()
                } else {
                    println!("{:?}", err.name);
                    println!("success!");
                    break;
                }
            }
            wasmi::ResumableCall::Finished => unreachable!(),
        }
    }
}

I have attached the wasm data here: wasm_data.zip (zipped in order to satisfy GitHub)

The way the Wasm code behaves is that it calls ext_logging_max_level_version_1, then the test panics when calling resume because of InstructionTrap(MemoryOutOfBounds), which isn't supposed to happen. Normally, the Wasm code should then call ext_allocator_malloc_version_1, then other functions.

If I replace the body of the closure with this, then the test succeeds (the test succeeds when ext_allocator_malloc_version_1 is called):

move |_caller, _params, _ret| {
    if name == "ext_logging_max_level_version_1" {
        _ret[0] = wasmi::Value::I32(0);
        Ok(())
    } else {
        Err(wasmi::core::Trap::from(InterruptedTrap {
            name: name.clone(),
        }))
    }
}

In other words, if the function directly returns when it is called, then the test works. If, however, the function emits a trap, then is resumed, then the test fails.
This makes me think that there is a bug in the resumable functions feature.

Please note, as I've mentioned in a comment above, that this happens only with a specific Wasm module (the Westend v0.9.30 runtime). A different but very similar Wasm module (the Polkadot v0.9.16 runtime) works just fine.

@Robbepop
Copy link

Robbepop commented Feb 12, 2023

@tomaka Thanks a lot for digging into the problem and producing a "minimal" test case!

I am going to try to further minimize it for wasmi (if possible) and try to find and fix the issue. :)
Generally, after having implemented so many new features in wasmi over the course of the last 2 months the focus for the next weeks is to stabilize, fix bugs, create more tests, improve CI and fuzzing etc.

Please note, as I've mentioned in a comment above, that this happens only with a specific Wasm module (the Westend v0.9.30 runtime). A different but very similar Wasm module (the Polkadot v0.9.16 runtime) works just fine.

Also thanks for letting me know about this. Feels good to know that at least sometimes it is actually working. 😅

Robbepop added a commit to wasmi-labs/wasmi that referenced this pull request Feb 12, 2023
Fixes bug encountered in this GitHub PR: smol-dot/smoldot#18
@Robbepop
Copy link

Robbepop commented Feb 12, 2023

@tomaka I fixed the bug in this PR: wasmi-labs/wasmi#671
On my machine your provided testcase runs fine with the fix. I added some smaller testcases after I found out about the issue.
Please test yourself that this bug is fixed and please check if now smoldot works with the fixed wasmi version.
The fix was very simple fortunately.

Robbepop added a commit to wasmi-labs/wasmi that referenced this pull request Feb 12, 2023
* fix bug in resumable calls

Fixes bug encountered in this GitHub PR: smol-dot/smoldot#18

* add tests for the resumable call bug fix
@tomaka
Copy link
Contributor Author

tomaka commented Feb 13, 2023

Seems to work, thank you!

Cargo.toml Outdated
@@ -82,7 +82,7 @@ smallvec = "1.10.0"
snow = { version = "0.9.1", default-features = false, features = ["default-resolver"] }
tiny-keccak = { version = "2.0", features = ["keccak"] }
twox-hash = { version = "1.6.3", default-features = false }
wasmi = { version = "0.9.1", default-features = false, features = ["core"] } # TODO: having to add `core` is sketchy; maybe report this
wasmi = { git = "https://github.com/paritytech/wasmi", default-features = false } # TODO: { version = "0.25.0", default-features = false }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR is ready to go after this is switched back to a published version

@Robbepop
Copy link

Robbepop commented Feb 13, 2023

@tomaka I just release wasmi v0.26.1 that includes the bug fix to unblock you.

@Robbepop
Copy link

Robbepop commented Feb 13, 2023

@tomaka btw. if you want to run wasmi on steroids it is important that you compile it with:

[profile.release]
lto = "fat"
codegen-units = 1

makes a huge difference for wasmi, especially the newer versions of it.

@tomaka
Copy link
Contributor Author

tomaka commented Feb 13, 2023

@tomaka btw. if you want to run wasmi on steroids it is important that you compile it with:

[profile.release]
lto = "fat"
codegen-units = 1

makes a huge difference for wasmi, especially the newer versions of it.

It's already the case:

smoldot/Cargo.toml

Lines 128 to 136 in 3b8f08b

[profile.release]
panic = "abort"
lto = true
# codegen-units set to 1 to avoid performance regressions when combined with LTO
# See https://github.com/rust-lang/rust/issues/47745
# https://doc.rust-lang.org/rustc/codegen-options/index.html#codegen-units
codegen-units = 1
incremental = false
#strip = "symbols" # TODO: uncomment once stable

@tomaka tomaka marked this pull request as ready for review February 13, 2023 11:10
@tomaka tomaka added this pull request to the merge queue Feb 13, 2023
Merged via the queue into smol-dot:main with commit d61f5cf Feb 13, 2023
@tomaka tomaka deleted the update-wasmi branch February 13, 2023 11:27
@Robbepop
Copy link

Robbepop commented Feb 15, 2023

@tomaka Do you have any benchmark comparison between old wasmi and new wasmi for smoldot? I'd be very interested in.

Also another question that was brought up by @athei was if it was tested and verified that smoldot is can still properly sync relay chain from genesis with the new wasmi.

cc @athei

@athei
Copy link

athei commented Feb 15, 2023

Disabling all Wasm features (such as bulk-memory) also doesn't change anything, which makes me think that the runtime doesn't any these features.

You are correct. The runtime (and PvF) are restricted to the wasm MVP for now. AFAIK the only exception is mutable globals. But don't nail me down on this. But most def nothing as big as bulk memory.

Also another question that was brought up by @athei was if it was tested and verified that smoldot is can still properly sync relay chain from genesis with the new wasmi.

I brought this up because substrate itself will most likely never upgrade to a newer wasmi. We doubled down on wasmtime and will probably simplify the code soon my removing swappable executor support. So smoldot right now would be the only piece of software which we can use to do a burn-in using wasmi (sync a big chain from genesis). Maybe having a manually triggerable CI job in this repo would be a nice addition. However, it is probably not possible to run something for days using gha.

@tomaka Sorry for posting this into this closed issue. This is a separate discussion and I can transfer it to a new issue if you want to engage with it at all.

@tomaka
Copy link
Contributor Author

tomaka commented Feb 15, 2023

@tomaka Do you have any benchmark comparison between old wasmi and new wasmi for smoldot? I'd be very interested in.

No. I don't benchmark this because the light client has no performance issue at all w.r.t. running the runtime. Even if wasmi was ten times slower it would still be fine.

Also another question that was brought up by @athei was if it was tested and verified that smoldot is can still properly sync relay chain from genesis with the new wasmi.

Haven't tested either, as the light client doesn't sync from the genesis, and the full node uses wasmtime by default. Syncing from genesis with smoldot takes a long time and might stall due to unfixed networking issues, so I'm not very motivated to try it out. However, I've just tried syncing a bit around block 5 million (where my database is) and everything seems to work.

@tomaka
Copy link
Contributor Author

tomaka commented Feb 15, 2023

Also, smoldot can only sync up to around block 6 million or so if I remember correctly. At some point, the runtime starts using features that smoldot doesn't implement yet (code substitutes, state_version v1, and child tries come to mind).

@athei
Copy link

athei commented Feb 15, 2023

Haven't tested either, as the light client doesn't sync from the genesis, and the full node uses wasmtime by default. Syncing from genesis with smoldot takes a long time and might stall due to unfixed networking issues

Sorry for my ignorance: Those network issues arise when you use smoldot as a full node with wasmi? Sorry if this didn't make sense. I didn't look into smoldot for some time.

At some point, the runtime starts using features that smoldot doesn't implement yet (code substitutes, state_version v1, and child tries come to mind).

Child tries are pretty old. But were only used for contracts which are not on the relay chain. But I think they were used for crowdloans and this is probably when they become a problem for smoldot. But I get it. It adds a lot of API surface when this could have been implemented with prefix trees as well.

@tomaka
Copy link
Contributor Author

tomaka commented Feb 15, 2023

Sorry for my ignorance: Those network issues arise when you use smoldot as a full node with wasmi? Sorry if this didn't make sense. I didn't look into smoldot for some time.

Completely orthogonal to wasmi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants