Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHORE] Update PyO3 and use their new Bound API #2793

Merged
merged 12 commits into from
Sep 10, 2024
Merged

Conversation

kevinzwang
Copy link
Member

@kevinzwang kevinzwang commented Sep 5, 2024

We cannot yet update to the latest minor version, v0.22, because rust-numpy only supports v0.21 right now.

There may be some small additional memory/performance optimizations we can do with this new API, but I will leave it for another PR. This one should not have any regressions at least.

Maybe next time i'll try using Cursor to do this 😄

@github-actions github-actions bot added the chore label Sep 5, 2024
Copy link

codspeed-hq bot commented Sep 5, 2024

CodSpeed Performance Report

Merging #2793 will not alter performance

Comparing kevin/pyo3-bound-api (495666b) with main (3e2d25b)

Summary

✅ 16 untouched benchmarks

@kevinzwang kevinzwang marked this pull request as ready for review September 7, 2024 01:11
@kevinzwang
Copy link
Member Author

There seems to be a strange issue where the order of dataframe columns is wrong now. Looking into that

Copy link

codecov bot commented Sep 7, 2024

Codecov Report

Attention: Patch coverage is 86.38132% with 70 lines in your changes missing coverage. Please review.

Project coverage is 64.06%. Comparing base (3e2d25b) to head (495666b).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-micropartition/src/python.rs 34.04% 31 Missing ⚠️
src/common/io-config/src/python.rs 55.55% 12 Missing ⚠️
src/daft-table/src/ffi.rs 30.00% 7 Missing ⚠️
src/daft-parquet/src/python.rs 84.84% 5 Missing ⚠️
src/common/file-formats/src/file_format_config.rs 0.00% 2 Missing ⚠️
src/common/py-serde/src/python.rs 88.88% 2 Missing ⚠️
src/daft-local-execution/src/run.rs 0.00% 2 Missing ⚠️
src/daft-parquet/src/read.rs 91.30% 2 Missing ⚠️
src/daft-scheduler/src/adaptive.rs 0.00% 2 Missing ⚠️
src/daft-scheduler/src/scheduler.rs 95.12% 2 Missing ⚠️
... and 2 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2793      +/-   ##
==========================================
+ Coverage   63.36%   64.06%   +0.70%     
==========================================
  Files        1016     1007       -9     
  Lines      114231   112934    -1297     
==========================================
- Hits        72381    72350      -31     
+ Misses      41850    40584    -1266     
Files with missing lines Coverage Δ
src/common/arrow-ffi/src/lib.rs 99.23% <100.00%> (+0.01%) ⬆️
src/common/daft-config/src/lib.rs 82.14% <100.00%> (ø)
src/common/daft-config/src/python.rs 70.62% <ø> (+3.63%) ⬆️
src/common/file-formats/src/python.rs 42.85% <100.00%> (ø)
src/common/resource-request/src/lib.rs 60.86% <100.00%> (ø)
src/common/system-info/src/lib.rs 96.15% <100.00%> (ø)
src/daft-core/src/array/ops/cast.rs 85.69% <100.00%> (-0.01%) ⬇️
src/daft-core/src/array/ops/concat_agg.rs 100.00% <100.00%> (ø)
src/daft-core/src/array/ops/len.rs 100.00% <100.00%> (ø)
src/daft-core/src/array/ops/list_agg.rs 74.64% <100.00%> (ø)
... and 40 more

... and 13 files with indirect coverage changes

Copy link
Member

@samster25 samster25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good! But Let's make sure to add pyo3::intern! everywhere it was removed.

src/common/io-config/src/python.rs Show resolved Hide resolved
src/common/io-config/src/python.rs Outdated Show resolved Hide resolved
src/common/io-config/src/python.rs Outdated Show resolved Hide resolved
src/common/py-serde/src/python.rs Outdated Show resolved Hide resolved
serialized
.extract::<&PyBytes>(py)
.map(|s| $crate::bincode::deserialize(s.as_bytes()).unwrap())
pub fn _from_serialized(serialized: &[u8]) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this is a good contender for the bound api so we don't have to Clone / Copy the underlying bytes before passing them into rust. We could instead of a GILBound to the bytes and create the underlying rust obj

Copy link
Member Author

@kevinzwang kevinzwang Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean. When a pyfunction parameter is of type &[u8], pyo3 just calls .as_bytes() internally and returns a reference to the actual bytes in the python object without copying.

https://github.com/PyO3/pyo3/blob/main/src/conversions/std/slice.rs#L42

src/daft-parquet/src/python.rs Outdated Show resolved Hide resolved
@@ -18,7 +18,7 @@ pub mod pylib {
pub fn read_parquet(
py: Python,
uri: &str,
columns: Option<Vec<&str>>,
columns: Option<Vec<String>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this have to be owned now?

Copy link
Member Author

@kevinzwang kevinzwang Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why pyo3 doesn't allow Vec<&str> anymore but I suspect it's because of some sort of lifetime issue with the new Bound stuff

convert_pyarrow_parquet_read_result_into_py(py, schema, all_arrays, num_rows, pyarrow)
}
#[allow(clippy::too_many_arguments)]
#[pyfunction]
pub fn read_parquet_bulk(
py: Python,
uris: Vec<&str>,
columns: Option<Vec<&str>>,
uris: Vec<String>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we making this owned if we just take the ref anyways?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyo3 limitations. See comment on read_parquet

@@ -460,9 +461,9 @@ async fn stream_parquet_single(
}

#[allow(clippy::too_many_arguments)]
async fn read_parquet_single_into_arrow(
async fn read_parquet_single_into_arrow<T: AsRef<str>>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this instead be ToString, since we call that right after?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the generic here and just pass around Strings

src/daft-scheduler/src/scheduler.rs Show resolved Hide resolved
Copy link
Member

@samster25 samster25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@kevinzwang kevinzwang merged commit 4582192 into main Sep 10, 2024
39 checks passed
@kevinzwang kevinzwang deleted the kevin/pyo3-bound-api branch September 10, 2024 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants