fix(python): from_arrow_fixed_size_list #16751

deanm0000 · 2024-06-05T19:32:18Z

In the process of putting this fix in I noticed that we'd have a bug if there were both structs and dictionaries because when it's adding those columns back, it doesn't merge them between the dictionary_cols and struct_cols. There isn't any reason to separate the special cases so I put them in a single dict along with fixed_size_list.

codecov · 2024-06-05T20:05:36Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.45%. Comparing base (6f3fd8e) to head (cad6050).
Report is 4 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff            @@
##             main   #16751    +/-   ##
========================================
  Coverage   81.45%   81.45%            
========================================
  Files        1413     1413            
  Lines      186306   186045   -261     
  Branches     2777     2754    -23     
========================================
- Hits       151750   151552   -198     
+ Misses      34036    33990    -46     
+ Partials      520      503    -17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

deanm0000 · 2024-06-05T21:07:03Z

My test is probably too big but I didn't know what the alternative was. I did some internal tests with row_group sizing and creating pyarrow tables with parquet files but the only way to I could reproduce the bug was by saving a parquet file that is >=131073 rows and then reopening it. If it's fewer rows than that then it doesn't manifest.

Additionally, the windows runner threw an error because the file was open but the linux runners didn't have a problem so I added a means to retry 5 times with a 1 second wait.

ritchie46 · 2024-06-06T06:59:27Z

I want to look if this is maybe something wrong on the rust side.

ritchie46 · 2024-06-06T09:56:15Z

I want to wait on this work before looking into this one: #16747

deanm0000 · 2024-06-06T11:21:39Z

yeah I definitely think there's something upstream to address. There's also the goal of moving over to use stream/capsule protocol instead of pyarrow. One thing I found in the current state of things was that in

polars/py-polars/src/interop/arrow/to_rust.rs

Lines 56 to 71 in 5a0c803

    
           let dfs = rb 
        
               .iter() 
        
               .map(|rb| { 
        
                   let mut run_parallel = false; 
        
                   let columns = (0..names.len()) 
        
                       .map(|i| { 
        
                           let array = rb.call_method1("column", (i,))?; 
        
                           let arr = array_to_rust(&array)?; 
        
                           run_parallel |= matches!( 
        
                               arr.data_type(), 
        
                               ArrowDataType::Utf8 | ArrowDataType::Dictionary(_, _, _) 
        
                           ); 
        
                           Ok(arr) 
        
                       }) 
        
                       .collect::<PyResult<Vec<_>>>()?;

what happens is rb will have however many chunks/batches in the Table but then arr will be the full length for each batch so that's how/why the result is the multiple of the number of chunks. One thing that OP's issue didn't show is that if there is more than one column then the from_arrow will panic on a side error because the Array column will be longer than the other columns.

I only took up bandaiding this on the python side when I saw that Structs and Dictionaries already needed the bandaid. I would suggest that if the rust fix isn't ready before the next release that we put this bandaid on until it is. This can always be taken out but if the core functionality is broken....well that's not good for anybody.

fix(python): from_arrow_fixed_size_list

c1b1a0f

deanm0000 requested review from ritchie46, stinodego, c-peters, alexander-beedie, MarcoGorelli and reswqa as code owners June 5, 2024 19:32

github-actions bot added fix Bug fix python Related to Python Polars labels Jun 5, 2024

types

c800846

deanm0000 added 3 commits June 5, 2024 16:18

windows_test_issue

4aee954

import

66004ca

time as tm

cad6050

ritchie46 force-pushed the main branch from 0a696ff to 9c29683 Compare July 28, 2024 08:11

deanm0000 closed this Aug 26, 2024

deanm0000 deleted the FixedSizeList_from_arrow branch August 26, 2024 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(python): from_arrow_fixed_size_list #16751

fix(python): from_arrow_fixed_size_list #16751

deanm0000 commented Jun 5, 2024 •

edited

Loading

codecov bot commented Jun 5, 2024 •

edited

Loading

deanm0000 commented Jun 5, 2024

ritchie46 commented Jun 6, 2024

ritchie46 commented Jun 6, 2024

deanm0000 commented Jun 6, 2024

fix(python): from_arrow_fixed_size_list #16751

fix(python): from_arrow_fixed_size_list #16751

Conversation

deanm0000 commented Jun 5, 2024 • edited Loading

codecov bot commented Jun 5, 2024 • edited Loading

Codecov Report

deanm0000 commented Jun 5, 2024

ritchie46 commented Jun 6, 2024

ritchie46 commented Jun 6, 2024

deanm0000 commented Jun 6, 2024

deanm0000 commented Jun 5, 2024 •

edited

Loading

codecov bot commented Jun 5, 2024 •

edited

Loading