-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade pyo3 to 0.16 #956
Upgrade pyo3 to 0.16 #956
Conversation
Looking at the abundance of CI errors, I'm not sure I'm going to be the best person to shepherd this to completion. I don't know rust that well, and pyo3/tokenizers even less. With some guidance I might get there, but I'm not going to be autonomous. |
@h-vetinari Here is the pyo3 migration guide: https://pyo3.rs/v0.15.1/migration.html |
Hi here. First, I think we should move onto |
For the linting, you can normally use |
FYI, move onto pyo3 0.16 requires dropping Python 3.6 support. |
It will refuse to compile ? If that's the case it it's not great. |
I think so. Python 3.6 reached EOL in 23 Dec 2021 so users should upgrade if they care about security. |
If someone wants to push into this PR, I'd be thrilled to receive support (and make people collaborators on my fork if necessary). I mainly attempted this because pyo3>=0.15 is a hard requirement to support python 3.10 in conda-forge, and several NLP packages are blocked on not having tokenizers for 3.10. |
However, a minimal backport of #650 to It fails with something like:
|
PS. I don't care much about pyo3 0.15 or 0.16, though IMO python 3.6 support should really not be a determining factor in anything anymore. For perspective - a lot of projects following NEP 29 (among them numpy, scipy, pandas, etc.) also dropped support for python 3.7 in their latest releases already. |
Fair enough ! Let's try 0.16 then. |
Hey @h-vetinari, I've sent you a PR to upgrade to 0.16: h-vetinari#1 |
Thanks a lot! 🙃 I sent you an invite to collaborate on my fork, then you can push into this PR directly |
I'm not sure what changed in rust-numpy or pyo3 that makes this test case fail: https://github.com/huggingface/tokenizers/runs/5625562727?check_suite_focus=true @adamreichold Any idea? |
I am sorry but I have a hard time following the layers here. My first impression is that the test in question does not even reach the Rust code yet but fails already in the Python around it? That said, I think the best candidate for surfacing typing issues at runtime is that before 0.16, rust-numpy (incorrectly) did not check element type and dimension when downcasting to arrays, i.e. PyO3/rust-numpy#265 (I did not find any mention of downcasts with |
It's rejected here in Rust code tokenizers/bindings/python/src/tokenizer.rs Lines 420 to 423 in 1bb9884
|
And I suspect it has something to do with these code in tokenizers/bindings/python/src/tokenizer.rs Lines 259 to 340 in 1bb9884
|
This would indeed point to the downcast fixes and hence did probably only work by accident before. I think adding an tokenizers/bindings/python/src/tokenizer.rs Line 262 in 1bb9884
|
Fails with
I guess tokenizers also wants PyO3/rust-numpy#141 |
This should not have worked in the first place as
This would be the best solution, but the existing code does not really use the API provided by So instead of fn extract(ob: &PyAny) -> PyResult<Self> {
let array = ob.downcast::<PyArray1<u8>>()?;
let arr = array.as_array_ptr();
... one could do fn extract(ob: &PyAny) -> PyResult<Self> {
if npyffi::PyArray_Check(ob.py(), ob.as_ptr()) == 0 {
return Err(exceptions::PyTypeError::new_err(
"Expected an np.array",
));
}
let arr = ob.as_ptr() as *mut npyffi::PyArrayObject;
... and the rest of the code should continue to work as-is. (I have not actually tried to compile this so there are certainly errors in there.) |
I am sorry that I did not try this out myself, but from reading the code, I think the part let shape =
unsafe { slice::from_raw_parts((*arr).dimensions as *mut usize, (*arr).nd as usize) };
let n_elem = shape[0]; should probably check if (*arr).nd != 1 { /* return dimensionality error */ }
let n_elem = *(*arr).dimenions;
if (*arr).flags & (npyffi::NPY_ARRAY_C_CONTIGUOUS | npyffi::NPY_ARRAY_F_CONTIGUOUS) == 0 { /* return non-contiguours error */ } |
02210b8
to
f4e3d48
Compare
numpy = "0.12" | ||
ndarray = "0.13" | ||
env_logger = "0.9.0" | ||
pyo3 = "0.16.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the remaining test failures could be resolved by adding
resolver = "2" # or edition = "2021"
[dev-dependencies]
pyo3 = { version = "0.16", features = ["auto-initialize"]
This used to be part of the default features but is not any more since 0.14.
(If Rust 1.51 which introduced the resolver = "2"
option is too new, then the feature can just be added to the normal [dependencies]
entry.)
Hi @messense , I don't have time today to do a full review (I tried to make the tests run during the day so you could see what was happening). This is a becoming a very big PR, which I don't think is a good thing about a PR. Do you mind adding comments yourself on the PRs of what is going on, and why changes are important ? It would help me tremendously review faster (otherwise I will just ask questions :)) The |
If this refers to the calls related to Hopefully, we will be able to implement PyO3/rust-numpy#141 eventually and the whole business can be done using safe code. |
bindings/python/src/tokenizer.rs
Outdated
let (type_num, data) = unsafe { ((*(*arr).descr).type_num, (*arr).data) }; | ||
let n_elem = array.shape()[0]; | ||
|
||
if type_num != 17 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that this second case is not about a Unicode array, but actually an array containing PyObject
which we do support since 0.16, so this whole method be able to become safe by using downcast::<PyArray1<PyObject>>()
and then array.readonly().as_array().iter()
which would even remove the requirement of a contiguous array.
Thank you so much for this ! This is a very cool and valuable PR. In terms of merging, what I project to do is to first do a release without this change I will probably wait a week or so afterwards to make sure those changes have no unintended consequences and we have a safe base for our Bigscience project. Then I will merge this PR, and release |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌🏻
Following the trend of other HF repos we're moving the the
Is all that is needed on your end. |
Sorry, messed up the rebase. 🤦 Fixing it |
Hey @Narsil 👋 I'm happy to keep rebasing this PR, but just wanted to check where things stand currently with the plans for 0.13.0 🙃 |
Rebase-conflicts-fixed-by: H. Vetinari <[email protected]>
Rebase-conflicts-fixed-by: H. Vetinari <[email protected]>
--no-default-features`
Why do they change?
Replace deprecated `PyUnicode_FromUnicode` with `PyUnicode_FromKindAndData`
Hi @h-vetinari , It's coming don't worry about rebasing if you want, I can always rebase later. |
Thanks for merging this @Narsil! :) Any timeline for 0.13? 🙃 |
Unfortunately no definite timeline. As you might have guessed, handling |
A very gentle ping for a tokenizer release with the updated pyo3 :) |
Another month, another ping... 🙃 |
Hey @h-vetinari After quite a long time (sorry, but there's definitely a lot to do and this is basically done on spare time from me.). I wanted to release 0.13.0 today, but afaik I cannot because How can we run the manylinux build and make it work. I had 2 ideas:
|
@Narsil do you have the error from the |
Yes it claims that it's using a static interpreter (which it is), but the feature |
Rebased #650 by @messense
Closes #934