Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade pyo3 to 0.16 #956

Merged
merged 8 commits into from
May 5, 2022
Merged

Upgrade pyo3 to 0.16 #956

merged 8 commits into from
May 5, 2022

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented Mar 21, 2022

Rebased #650 by @messense

Closes #934

@h-vetinari
Copy link
Contributor Author

Looking at the abundance of CI errors, I'm not sure I'm going to be the best person to shepherd this to completion. I don't know rust that well, and pyo3/tokenizers even less. With some guidance I might get there, but I'm not going to be autonomous.

@messense
Copy link
Contributor

@h-vetinari Here is the pyo3 migration guide: https://pyo3.rs/v0.15.1/migration.html

@Narsil
Copy link
Collaborator

Narsil commented Mar 21, 2022

Hi here.

First, I think we should move onto 0.16 directly since it's the latest version of pyo3 (as long as we're making an update here, we might as well get the latest one). Unless there's some unintended breaking changes that could prevent this from happening.

@Narsil
Copy link
Collaborator

Narsil commented Mar 21, 2022

For the linting, you can normally use make style within bindings/python to fix the format.
We also use clippy (cargo clippy) in the formatting

@messense
Copy link
Contributor

FYI, move onto pyo3 0.16 requires dropping Python 3.6 support.

@Narsil
Copy link
Collaborator

Narsil commented Mar 21, 2022

FYI, move onto pyo3 0.16 requires dropping Python 3.6 support.

It will refuse to compile ? If that's the case it it's not great.
tokenizers stopped building for 3.6 because there's not GH runner anymore, but if people are able to build still that would be better indeed. I don't see any killer feature for 0.16 in the changelog.

@messense
Copy link
Contributor

It will refuse to compile ? If that's the case it it's not great.

I think so.

Python 3.6 reached EOL in 23 Dec 2021 so users should upgrade if they care about security.

@h-vetinari
Copy link
Contributor Author

If someone wants to push into this PR, I'd be thrilled to receive support (and make people collaborators on my fork if necessary).

I mainly attempted this because pyo3>=0.15 is a hard requirement to support python 3.10 in conda-forge, and several NLP packages are blocked on not having tokenizers for 3.10.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Mar 21, 2022

However, a minimal backport of #650 to tags/python-v0.11.6 also didn't work.

It fails with something like:

import: 'tokenizers'
TypeError: type 'tokenizers.models.Model' is not an acceptable base type
thread '<unnamed>' panicked at 'An error occurred while initializing class BPE', /home/conda/feedstock_root/build_artifacts/tokenizers_1647850774618/_build_env/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.15.0/src/type_object.rs:102:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
pyo3_runtime.PanicException: An error occurred while initializing class BPE
thread '<unnamed>' panicked at 'Python API call failed'

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Mar 21, 2022

PS. I don't care much about pyo3 0.15 or 0.16, though IMO python 3.6 support should really not be a determining factor in anything anymore. For perspective - a lot of projects following NEP 29 (among them numpy, scipy, pandas, etc.) also dropped support for python 3.7 in their latest releases already.

@Narsil
Copy link
Collaborator

Narsil commented Mar 21, 2022

PS. I don't care much about pyo3 0.15 or 0.16, though IMO python 3.6 support should really not be a determining factor in anything anymore. For perspective - a lot of projects following NEP 29 (among them numpy, scipy, pandas, etc.) also dropped support for python 3.7 in their latest releases already.

Fair enough ! Let's try 0.16 then.

@messense
Copy link
Contributor

Hey @h-vetinari, I've sent you a PR to upgrade to 0.16: h-vetinari#1

@h-vetinari
Copy link
Contributor Author

Hey @h-vetinari, I've sent you a PR to upgrade to 0.16: h-vetinari#1

Thanks a lot! 🙃

I sent you an invite to collaborate on my fork, then you can push into this PR directly

@h-vetinari h-vetinari changed the title Upgrade pyo3 to 0.15 (redux) Upgrade pyo3 to 0.16 Mar 21, 2022
@messense
Copy link
Contributor

TypeError: PreTokenizedEncodeInput must be Union[PreTokenizedInputSequence, Tuple[PreTokenizedInputSequence, PreTokenizedInputSequence]]

I'm not sure what changed in rust-numpy or pyo3 that makes this test case fail: https://github.com/huggingface/tokenizers/runs/5625562727?check_suite_focus=true

@adamreichold Any idea?

@adamreichold
Copy link

@adamreichold Any idea?

I am sorry but I have a hard time following the layers here. My first impression is that the test in question does not even reach the Rust code yet but fails already in the Python around it? That said, I think the best candidate for surfacing typing issues at runtime is that before 0.16, rust-numpy (incorrectly) did not check element type and dimension when downcasting to arrays, i.e. PyO3/rust-numpy#265 (I did not find any mention of downcasts with PyArray though. Actually I did not find PyArray mentioned at all?)

@messense
Copy link
Contributor

My first impression is that the test in question does not even reach the Rust code yet but fails already in the Python around it?

It's rejected here in Rust code

Err(exceptions::PyTypeError::new_err(
"PreTokenizedEncodeInput must be Union[PreTokenizedInputSequence, \
Tuple[PreTokenizedInputSequence, PreTokenizedInputSequence]]",
))

@messense
Copy link
Contributor

And I suspect it has something to do with these code in PyArrayUnicode or PyArrayStr

struct PyArrayUnicode(Vec<String>);
impl FromPyObject<'_> for PyArrayUnicode {
fn extract(ob: &PyAny) -> PyResult<Self> {
let array = ob.downcast::<PyArray1<u8>>()?;
let arr = array.as_array_ptr();
let (type_num, elsize, alignment, data) = unsafe {
let desc = (*arr).descr;
(
(*desc).type_num,
(*desc).elsize as usize,
(*desc).alignment as usize,
(*arr).data,
)
};
let n_elem = array.shape()[0];
// type_num == 19 => Unicode
if type_num != 19 {
return Err(exceptions::PyTypeError::new_err(
"Expected a np.array[dtype='U']",
));
}
unsafe {
let all_bytes = std::slice::from_raw_parts(data as *const u8, elsize * n_elem);
let seq = (0..n_elem)
.map(|i| {
let bytes = &all_bytes[i * elsize..(i + 1) * elsize];
let unicode = pyo3::ffi::PyUnicode_FromUnicode(
bytes.as_ptr() as *const _,
elsize as isize / alignment as isize,
);
let gil = Python::acquire_gil();
let py = gil.python();
let obj = PyObject::from_owned_ptr(py, unicode);
let s = obj.cast_as::<PyString>(py)?;
Ok(s.to_string_lossy().trim_matches(char::from(0)).to_owned())
})
.collect::<PyResult<Vec<_>>>()?;
Ok(Self(seq))
}
}
}
impl From<PyArrayUnicode> for tk::InputSequence<'_> {
fn from(s: PyArrayUnicode) -> Self {
s.0.into()
}
}
struct PyArrayStr(Vec<String>);
impl FromPyObject<'_> for PyArrayStr {
fn extract(ob: &PyAny) -> PyResult<Self> {
let array = ob.downcast::<PyArray1<u8>>()?;
let arr = array.as_array_ptr();
let (type_num, data) = unsafe { ((*(*arr).descr).type_num, (*arr).data) };
let n_elem = array.shape()[0];
if type_num != 17 {
return Err(exceptions::PyTypeError::new_err(
"Expected a np.array[dtype='O']",
));
}
unsafe {
let objects = std::slice::from_raw_parts(data as *const PyObject, n_elem);
let seq = objects
.iter()
.map(|obj| {
let gil = Python::acquire_gil();
let py = gil.python();
let s = obj.cast_as::<PyString>(py)?;
Ok(s.to_string_lossy().into_owned())
})
.collect::<PyResult<Vec<_>>>()?;
Ok(Self(seq))
}
}
}

@adamreichold
Copy link

And I suspect it has something to do with these code in PyArrayUnicode or PyArrayStr

struct PyArrayUnicode(Vec<String>);
impl FromPyObject<'_> for PyArrayUnicode {
fn extract(ob: &PyAny) -> PyResult<Self> {
let array = ob.downcast::<PyArray1<u8>>()?;
let arr = array.as_array_ptr();
let (type_num, elsize, alignment, data) = unsafe {
let desc = (*arr).descr;
(
(*desc).type_num,
(*desc).elsize as usize,
(*desc).alignment as usize,
(*arr).data,
)
};
let n_elem = array.shape()[0];
// type_num == 19 => Unicode
if type_num != 19 {
return Err(exceptions::PyTypeError::new_err(
"Expected a np.array[dtype='U']",
));
}
unsafe {
let all_bytes = std::slice::from_raw_parts(data as *const u8, elsize * n_elem);
let seq = (0..n_elem)
.map(|i| {
let bytes = &all_bytes[i * elsize..(i + 1) * elsize];
let unicode = pyo3::ffi::PyUnicode_FromUnicode(
bytes.as_ptr() as *const _,
elsize as isize / alignment as isize,
);
let gil = Python::acquire_gil();
let py = gil.python();
let obj = PyObject::from_owned_ptr(py, unicode);
let s = obj.cast_as::<PyString>(py)?;
Ok(s.to_string_lossy().trim_matches(char::from(0)).to_owned())
})
.collect::<PyResult<Vec<_>>>()?;
Ok(Self(seq))
}
}
}
impl From<PyArrayUnicode> for tk::InputSequence<'_> {
fn from(s: PyArrayUnicode) -> Self {
s.0.into()
}
}
struct PyArrayStr(Vec<String>);
impl FromPyObject<'_> for PyArrayStr {
fn extract(ob: &PyAny) -> PyResult<Self> {
let array = ob.downcast::<PyArray1<u8>>()?;
let arr = array.as_array_ptr();
let (type_num, data) = unsafe { ((*(*arr).descr).type_num, (*arr).data) };
let n_elem = array.shape()[0];
if type_num != 17 {
return Err(exceptions::PyTypeError::new_err(
"Expected a np.array[dtype='O']",
));
}
unsafe {
let objects = std::slice::from_raw_parts(data as *const PyObject, n_elem);
let seq = objects
.iter()
.map(|obj| {
let gil = Python::acquire_gil();
let py = gil.python();
let s = obj.cast_as::<PyString>(py)?;
Ok(s.to_string_lossy().into_owned())
})
.collect::<PyResult<Vec<_>>>()?;
Ok(Self(seq))
}
}
}

This would indeed point to the downcast fixes and hence did probably only work by accident before. I think adding an .unwrap(); instead of ? in

let array = ob.downcast::<PyArray1<u8>>()?;
might shed some light on this.

@messense
Copy link
Contributor

messense commented Mar 21, 2022

Fails with PyDowncastError

pyo3_runtime.PanicException: called Result::unwrap() on an Err value: PyDowncastError { from: array(['My', 'name', 'is', 'John'], dtype='<U7'), to: "PyArray<T, D>" }

I guess tokenizers also wants PyO3/rust-numpy#141

@adamreichold
Copy link

Fails with PyDowncastError

pyo3_runtime.PanicException: called Result::unwrap() on an Err value: PyDowncastError { from: array(['My', 'name', 'is', 'John'], dtype='<U7'), to: "PyArray<T, D>" }

This should not have worked in the first place as <U7 would an array of 7-byte-large elements of Py_UCS4 stored in little-endian byte-order. So considering this a single byte u8 elements should yield at least incorrect strides.

I guess tokenizers also wants PyO3/rust-numpy#141

This would be the best solution, but the existing code does not really use the API provided by rust-numpy anyway (which is why this worked in the past), so I think an immediate fix would be to just use the PyArray_Check directly (which checks only if it is an array but does not consider element type and dimensionality):

So instead of

fn extract(ob: &PyAny) -> PyResult<Self> {
        let array = ob.downcast::<PyArray1<u8>>()?;
        let arr = array.as_array_ptr();
...

one could do

fn extract(ob: &PyAny) -> PyResult<Self> {
        if npyffi::PyArray_Check(ob.py(), ob.as_ptr()) == 0 {
            return Err(exceptions::PyTypeError::new_err(
                "Expected an np.array",
            ));
        }
        let arr = ob.as_ptr() as *mut npyffi::PyArrayObject;
...

and the rest of the code should continue to work as-is. (I have not actually tried to compile this so there are certainly errors in there.)

@adamreichold
Copy link

I am sorry that I did not try this out myself, but from reading the code, I think the part

let shape =
    unsafe { slice::from_raw_parts((*arr).dimensions as *mut usize, (*arr).nd as usize) };
let n_elem = shape[0];

should probably check nd for correctness and also verify contiguousness as it then accesses the data as a slice (Even a one-dimensional array could have non-unit strides.), i.e.

if (*arr).nd != 1 { /* return dimensionality error */ }
let n_elem = *(*arr).dimenions;

if (*arr).flags & (npyffi::NPY_ARRAY_C_CONTIGUOUS | npyffi::NPY_ARRAY_F_CONTIGUOUS) == 0 { /* return non-contiguours error */ }

@messense messense force-pushed the pyo3 branch 2 times, most recently from 02210b8 to f4e3d48 Compare March 21, 2022 14:43
numpy = "0.12"
ndarray = "0.13"
env_logger = "0.9.0"
pyo3 = "0.16.2"
Copy link

@adamreichold adamreichold Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the remaining test failures could be resolved by adding

resolver = "2" # or edition = "2021"

[dev-dependencies]
pyo3 = { version = "0.16", features = ["auto-initialize"]

This used to be part of the default features but is not any more since 0.14.

(If Rust 1.51 which introduced the resolver = "2" option is too new, then the feature can just be added to the normal [dependencies] entry.)

@Narsil
Copy link
Collaborator

Narsil commented Mar 21, 2022

Hi @messense ,

I don't have time today to do a full review (I tried to make the tests run during the day so you could see what was happening).

This is a becoming a very big PR, which I don't think is a good thing about a PR.

Do you mind adding comments yourself on the PRs of what is going on, and why changes are important ? It would help me tremendously review faster (otherwise I will just ask questions :))

The unsafe calls are basically big NO in tokenizers.

@adamreichold
Copy link

The unsafe calls are basically big NO in tokenizers.

If this refers to the calls related to npyfii, this is not materially "more unsafe" than it already was as the old (and incorrect) version of downcast::<PyArray1<u8>> was doing the exact same thing. The only difference is the manual access to the dimensions, but I would say that this is balanced out by fixing the missing check for contiguous arrays which is missing on main (or alternatively the accesses would need to consider the array's stride to be fully general).

Hopefully, we will be able to implement PyO3/rust-numpy#141 eventually and the whole business can be done using safe code.

let (type_num, data) = unsafe { ((*(*arr).descr).type_num, (*arr).data) };
let n_elem = array.shape()[0];

if type_num != 17 {
Copy link

@adamreichold adamreichold Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that this second case is not about a Unicode array, but actually an array containing PyObject which we do support since 0.16, so this whole method be able to become safe by using downcast::<PyArray1<PyObject>>() and then array.readonly().as_array().iter() which would even remove the requirement of a contiguous array.

@Narsil
Copy link
Collaborator

Narsil commented Mar 23, 2022

Thank you so much for this !

This is a very cool and valuable PR.

In terms of merging, what I project to do is to first do a release without this change 0.12 which does contain some slight backward breaking changes (for the decoder) and the ability to drop the regex in ByteLevel (those are already on master). These are relatively significant changes and are necessary for HF's bigscience project https://bigscience.huggingface.co/.

I will probably wait a week or so afterwards to make sure those changes have no unintended consequences and we have a safe base for our Bigscience project.

Then I will merge this PR, and release 0.13 probably shortly after with all due tests.

Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌🏻

@Narsil
Copy link
Collaborator

Narsil commented Mar 28, 2022

Following the trend of other HF repos we're moving the the main branch instead of master.

git branch -m master main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a

Is all that is needed on your end.

@h-vetinari
Copy link
Contributor Author

Sorry, messed up the rebase. 🤦

Fixing it

@h-vetinari
Copy link
Contributor Author

Hey @Narsil 👋

I'm happy to keep rebasing this PR, but just wanted to check where things stand currently with the plans for 0.13.0 🙃

@Narsil
Copy link
Collaborator

Narsil commented Apr 19, 2022

Hi @h-vetinari ,

It's coming don't worry about rebasing if you want, I can always rebase later.
Currently as mentionned we're letting the version with just the needed changes for bigscience prove itself before merging this PR. In the end 0.12.1 was only released last week (0.12.0 had a breaking change wich ended up being pretty bad for transformers so it had to be reverted, took some time to run the full test suite before getting 0.12.1 done.)

@Narsil Narsil merged commit 519cc13 into huggingface:main May 5, 2022
@h-vetinari h-vetinari deleted the pyo3 branch May 5, 2022 16:07
@h-vetinari
Copy link
Contributor Author

Thanks for merging this @Narsil! :)

Any timeline for 0.13? 🙃

@Narsil
Copy link
Collaborator

Narsil commented May 19, 2022

Unfortunately no definite timeline. As you might have guessed, handling tokenizers is only a little part of what I do at HF and releases do take quite a bit of attention.

@h-vetinari
Copy link
Contributor Author

A very gentle ping for a tokenizer release with the updated pyo3 :)

@h-vetinari
Copy link
Contributor Author

Another month, another ping... 🙃

@Narsil
Copy link
Collaborator

Narsil commented Sep 19, 2022

Hey @h-vetinari After quite a long time (sorry, but there's definitely a lot to do and this is basically done on spare time from me.).

I wanted to release 0.13.0 today, but afaik I cannot because manylinux2010 wheel is built with a static interpreter and I don't really know how to fix that issue.

How can we run the manylinux build and make it work.

I had 2 ideas:

  • Finding some quay image with shared python interpreter (couldn't find one even in the recommended crates to do distribution)
  • Removing auto-initialize just for those manylinux builds and place the interpreter inside for them, but it seems it needs a bit more work : Enabling static interpreter embedding for manylinux. #1064

@davidhewitt
Copy link
Contributor

@Narsil do you have the error from the manylinux2010 build? Maybe I can offer insight.

@Narsil
Copy link
Collaborator

Narsil commented Sep 19, 2022

Yes it claims that it's using a static interpreter (which it is), but the feature extension-module should be used, which should (afaik) disable the warning, and compile properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade pyo3 version
6 participants