Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core dumped when initializing a pyclass on Linux only #1120

Closed
Hywan opened this issue Aug 27, 2020 · 11 comments
Closed

Core dumped when initializing a pyclass on Linux only #1120

Hywan opened this issue Aug 27, 2020 · 11 comments
Labels

Comments

@Hywan
Copy link
Contributor

Hywan commented Aug 27, 2020

  • Your operating system and version: Linux (Ubuntu 18/20, doesn't matter)
  • Your python version: Python3.8 (3.6, 3.7 also)
  • How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?: Buildin
  • Your Rust version (rustc --version): 1.45.2
  • Your PyO3 version: 0.11.1
  • Have you tried using latest PyO3 master (replace version = "0.x.y" with git = "https://github.com/PyO3/pyo3")?: No.

So, this bug is weird. It happens only on Linux (not on macOS nor Windows, I've tested).

I'm rewriting python-ext-wasm, a project that uses Pyo3. Here is the PR, wasmerio/wasmer-python#212.

To build:

$ git clone https://gtihub.com/Hywan/python-ext-wasm
$ cd python-ext-wasm
$ git fetch
$ git checkout -b reboot
$ just prelude
$ source .env/bin/activate
$ just build

To test, comment everything inside the tests/test_module.py file except line 20 to 21 (def test_compile_wat) as it is a small test case involving very few lines of code, and it fails.

Then run with:

$ just test tests/test_module.py
…
Fatal Python error: Aborted
[stack trace]
Aborted (core dumped)

I'm currently tracing what happens. The Module pyclass is defined as: struct Module { inner: wasmer::Module }. If I change to inner: u32, it works. So it seems that wasmer::Module is not liked by Pyo3 or Python.

The Module pyclass is defined in src/module.rs. The wasmer::Module is defined in https://github.com/wasmerio/wasmer/blob/80290e48055655a63f05f2c329f2ace4adbc712f/lib/api/src/module.rs#L33-L37 (for the record: struct Module { store: Store, artifact: Arc<dyn Artifact> }).

I'm digging for the moment. PyClassInitiliazer.create_cell_from_subtype is working so far. Since it is the last thing returned by the __new__ implementation, I don't understand where it can fail. Or maybe the class isn't well initialized.

Important: It works on macOS and Windows. It fails only on Linux. It's not random, but for some Module (Python class), it works! If you take the def test_exports() test case, in the same test_module.py file, you'll see that it works.

It drives me crazy. Help :-).

@Hywan
Copy link
Contributor Author

Hywan commented Aug 27, 2020

I've rewritten the test case to separate each line, and see dbg statements from Pyo3 in-between.

def test_compile_wat():
    print("--> store\n")
    store = Store()
    print("/ store\n");

    print("--> module\n")
    module = Module(store, '(module)')
    print("/ module\n")

    print("--> isinstance")
    assert isinstance(module, Module)
    print("/ isinstance")

The Fatal Python error: Aborted comes after / isinstance. It seems to be related to the destructor of the object.

@sebpuetz
Copy link
Contributor

sebpuetz commented Aug 27, 2020

I can reproduce the issue locally, this reliably causes crashes:

>>> from wasmer import Store, Module
>>> store = Store()
>>> module = Module(store, '(module)')
>>> del store
>>> del module
[1]    32123 abort (core dumped)  ipython

Switching the order of the dels fixes the problem. There was an issue with unsendable classes which got fixed in #1058 and #1060, but I could still reproduce the issue after cherry-picking the commits on 0.11.1.

I took a quick look at the Store and Module implementations but didn't have to time to dig through all of that whether Module steals a reference to the &Store. That could lead to a double-free and cause the breakage. But that's just a blind stab, I haven't checked for any of that. At least within the Python module everything looked sound.

@Hywan
Copy link
Contributor Author

Hywan commented Aug 27, 2020

So.

The destructor of Store is called first. It is called by PyCell::py_drop. The current code (as of 0.11.1, same on master at the time of writing) is the following:

ManuallyDrop::drop(&mut self.inner.value);
self.dict.clear_dict(py);
self.weakref.clear_weakrefs(self.as_ptr(), py);
self.inner.ob_base.py_drop(py);

So the destructor Store is called on the Rust (impl Drop for Store), then dict, weakref and inner are executed correctly.

Then, the destructor of Module is called:

  • ManuallyDrop::drop is OK, the destructor on the Rust side is called (impl Drop for Module).
  • dict.clear_dict is failing.

Second run is OK. Having printf fixes the problem. Yepee…

To be continued :-p.

@sebpuetz
Copy link
Contributor

Most definitely not related to unsendable, I removed the attribute and the issue persists.

It's not even necessary to delete any variables, it sometimes crashes while cleaning up after exiting Python:

>>> from wasmer import Store, Module
>>> store = Store()
>>> module = Module(store, '(module)')
Do you really want to exit ([y]/n)? 
[1]    7632 abort (core dumped)  ipython

@Hywan
Copy link
Contributor Author

Hywan commented Aug 27, 2020

Yes, I come to the same conclusion.

@sebpuetz
Copy link
Contributor

sebpuetz commented Aug 27, 2020

So.

The destructor of Store is called first. It is called by PyCell::py_drop. The current code (as of 0.11.1, same on master at the time of writing) is the following:

ManuallyDrop::drop(&mut self.inner.value);
self.dict.clear_dict(py);
self.weakref.clear_weakrefs(self.as_ptr(), py);
self.inner.ob_base.py_drop(py);

So the destructor Store is called on the Rust (impl Drop for Store), then dict, weakref and inner are executed correctly.

Then, the destructor of Module is called:

* `ManuallyDrop::drop` is OK, the destructor on the Rust side is called (`impl Drop for Module`).

* `dict.clear_dict` is failing.

I'm actually seeing failure before dict.clear_dict:

    unsafe fn py_drop(&mut self, py: Python) {
        println!("enter pydrop");
        ManuallyDrop::drop(&mut self.inner.value);
        println!("dropped");
        self.dict.clear_dict(py);
        println!("cleared dict");
        self.weakref.clear_weakrefs(self.as_ptr(), py);
        self.inner.ob_base.py_drop(py);
    }

"dropped" never gets printed. So my hunch is that something within the module is gone once the store is gone?

@Hywan
Copy link
Contributor Author

Hywan commented Aug 27, 2020

That's what I'm digging right now. It doesn't seem related to Pyo3. Want to be sure before closing the issue :-).

@davidhewitt
Copy link
Member

Many thanks @sebpuetz for helping investigate this! I don't see any obvious red flags from the source I just scanned. Are you able to reproduce this issue using cargo run or in a test instead of the Python REPL? Should be possible to create Py<Store>, borrow() it to create a module (and eventually Py<Module>), and then drop the two objects.

It doesn't seem like this is a PyO3 issue at the moment, but happy to be proved wrong!

@sebpuetz
Copy link
Contributor

That reminder made me take another look and it's not related to pyo3:

    #[test]
    fn test2() {
        let store = wasmer::Store::default();
        let module = wasmer::Module::new(&store, "(module)").unwrap();
        mem::drop(store);
        mem::drop(module); // abort
    }

Stepping through the code in a debugger shows that the abort happens while dropping wasmer_engine_jit::unwind::systemv::UnwindRegistry. More specifically, the last call before abort is __deregister_frame in UnwindRegistry::drop. __deregister_frame is an extern "C" function.

A fix is switching the order of the struct members in wasmer::Module:

pub struct Module {
    artifact: Arc<dyn Artifact>,
    store: Store,
}

Artifact holds a pointer to a Vec owned by Store that gets passed to __deregister_frame. If the Store goes out first, the pointer in Artifact's UnwindRegistry becomes invalid but gets used anyways.

Allocation happens here: https://github.com/wasmerio/wasmer/blob/80290e48055655a63f05f2c329f2ace4adbc712f/lib/engine-jit/src/artifact.rs#L165-L166

Subverting lifetimes happens here: https://github.com/wasmerio/wasmer/blob/80290e48055655a63f05f2c329f2ace4adbc712f/lib/engine-jit/src/unwind/systemv.rs#L86

Assuming dead-pointers are still alive happens here: https://github.com/wasmerio/wasmer/blob/80290e48055655a63f05f2c329f2ace4adbc712f/lib/engine-jit/src/unwind/systemv.rs#L104-L106

I think you can close this, I'll open an issue over at wasmer.

@davidhewitt
Copy link
Member

Superb sleuthing, thanks again!

@Hywan
Copy link
Contributor Author

Hywan commented Aug 31, 2020

It's not related to Pyo3. I found the bug in Wasmer, wasmerio/wasmer#1581. Thanks for your time and sorry for the noise!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants