Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(python): from_arrow_fixed_size_list #16751

Closed
wants to merge 5 commits into from

Conversation

deanm0000
Copy link
Collaborator

@deanm0000 deanm0000 commented Jun 5, 2024

fixes #16614

In the process of putting this fix in I noticed that we'd have a bug if there were both structs and dictionaries because when it's adding those columns back, it doesn't merge them between the dictionary_cols and struct_cols. There isn't any reason to separate the special cases so I put them in a single dict along with fixed_size_list.

Copy link

codecov bot commented Jun 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.45%. Comparing base (6f3fd8e) to head (cad6050).
Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##             main   #16751    +/-   ##
========================================
  Coverage   81.45%   81.45%            
========================================
  Files        1413     1413            
  Lines      186306   186045   -261     
  Branches     2777     2754    -23     
========================================
- Hits       151750   151552   -198     
+ Misses      34036    33990    -46     
+ Partials      520      503    -17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@deanm0000
Copy link
Collaborator Author

My test is probably too big but I didn't know what the alternative was. I did some internal tests with row_group sizing and creating pyarrow tables with parquet files but the only way to I could reproduce the bug was by saving a parquet file that is >=131073 rows and then reopening it. If it's fewer rows than that then it doesn't manifest.

Additionally, the windows runner threw an error because the file was open but the linux runners didn't have a problem so I added a means to retry 5 times with a 1 second wait.

@ritchie46
Copy link
Member

I want to look if this is maybe something wrong on the rust side.

@ritchie46
Copy link
Member

I want to wait on this work before looking into this one: #16747

@deanm0000
Copy link
Collaborator Author

yeah I definitely think there's something upstream to address. There's also the goal of moving over to use stream/capsule protocol instead of pyarrow. One thing I found in the current state of things was that in

let dfs = rb
.iter()
.map(|rb| {
let mut run_parallel = false;
let columns = (0..names.len())
.map(|i| {
let array = rb.call_method1("column", (i,))?;
let arr = array_to_rust(&array)?;
run_parallel |= matches!(
arr.data_type(),
ArrowDataType::Utf8 | ArrowDataType::Dictionary(_, _, _)
);
Ok(arr)
})
.collect::<PyResult<Vec<_>>>()?;
what happens is rb will have however many chunks/batches in the Table but then arr will be the full length for each batch so that's how/why the result is the multiple of the number of chunks. One thing that OP's issue didn't show is that if there is more than one column then the from_arrow will panic on a side error because the Array column will be longer than the other columns.

I only took up bandaiding this on the python side when I saw that Structs and Dictionaries already needed the bandaid. I would suggest that if the rust fix isn't ready before the next release that we put this bandaid on until it is. This can always be taken out but if the core functionality is broken....well that's not good for anybody.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_parquet cannot read fixed size array cell correctly with use_pyarrow=True
2 participants