feat(python): Add 'drop_empty_rows' parameter for `read_excel` #18253

Rashik-raj · 2024-08-18T09:42:06Z

fixes: 18250
fixes: 14874

I've added a new flag to maintain backward compatibility for automatically dropping empty rows, allowing users to choose whether or not to drop them in Excel. It seems that dropping empty rows might be necessary for some users, so it could be useful to implement a similar option for other methods, such as read_csv.

…mpty_rows flag

codecov · 2024-08-18T10:34:19Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.23%. Comparing base (a284174) to head (7c8ebe3).
Report is 421 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #18253   +/-   ##
=======================================
  Coverage   80.23%   80.23%           
=======================================
  Files        1500     1500           
  Lines      198871   198883   +12     
  Branches     2837     2838    +1     
=======================================
+ Hits       159556   159582   +26     
+ Misses      38788    38774   -14     
  Partials      527      527

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

alexander-beedie · 2024-08-22T08:05:44Z

Thanks - I'll look at this shortly 👍

alexander-beedie · 2024-10-11T22:15:09Z

Apologies for the delay in review!
Thanks for the PR :)

D3SL · 2024-10-13T06:38:35Z

If I'm reading this right it doesn't address the second problem mentioned in #14874: columns which are sorted to have more leading NAs than the default schema inference length will be set to Null, silently deleting the entire column's worth of data.

For the same reason I'd strongly urge that this PR's feature should be False by default. Silently deleting data should always be treated as a showstopping bug, never a default behavior.

Think about it. Someone uses polars to read a CSV or Excel file. The data is new to them so they don't know the exact contents or descriptive statistics. That's the whole reason they're opening it in Polars. A couple dozen rows and two columns are silently deleted. There is no error or warning message.

How should they test for that data loss? They don't know how many rows there should be, or whether those two columns were actually completely empty or not so they can't programmatically test the data by asserting an expected count of rows or non-null values. The only way to spot this would be if they already knew the answer ahead of time, or if they open the data in excel and look at the entire thing by eye every single time.

Now apply it to the real world. Two columns are for recording errors, when there are none that value is left null. An empty row means there was a problem with the system and a record was created but no data was retrieved.

Polars' defaults would silently drop the empty rows and null out the error columns, leading someone to believe everything is fine when in reality potentially a large portion of their records show errors. Going by fastexcel's issue documenting this happening with only ~143k out of 500k rows someone could lose over two thirds of their data completely silently. Not even a warning that entire columns were deleted due to the first ~30% being empty.

[edit]

As an update the Fastexcel team plan on addressing this in the version after next. I do think it should still be noted in Polars' documentation that in pre-0.13.0 versions of the fastexcel backend columns may be lost depending on if there are more missing values than the default schema length, and the solution is schema_sample_rows=None.

Rashik-raj requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli and reswqa as code owners August 18, 2024 09:42

github-actions bot added the title needs formatting label Aug 18, 2024

FIX: fix empty rows being deleted in read_excel by introducing drop_e…

7c8ebe3

…mpty_rows flag

Rashik-raj force-pushed the add-option-to-drop-empty-row-in-excel branch from 858ef8c to 7c8ebe3 Compare August 18, 2024 10:00

Rashik-raj changed the title ~~Fix: Add option to drop empty row in excel to fix auto dropping of empty rows~~ fix(python): Add option to drop empty row in excel to fix auto dropping of empty rows Aug 20, 2024

github-actions bot added fix Bug fix python Related to Python Polars and removed title needs formatting labels Aug 20, 2024

alexander-beedie changed the title ~~fix(python): Add option to drop empty row in excel to fix auto dropping of empty rows~~ feat(python): Add 'drop_empty_rows' parameter for read_excel Oct 11, 2024

github-actions bot added the enhancement New feature or an improvement of an existing feature label Oct 11, 2024

alexander-beedie approved these changes Oct 11, 2024

View reviewed changes

alexander-beedie merged commit fc970f7 into pola-rs:main Oct 11, 2024
16 checks passed

alexander-beedie mentioned this pull request Oct 11, 2024

feat(python): Add 'drop_empty_rows' parameter for read_ods #19202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): Add 'drop_empty_rows' parameter for `read_excel` #18253

feat(python): Add 'drop_empty_rows' parameter for `read_excel` #18253

Rashik-raj commented Aug 18, 2024

codecov bot commented Aug 18, 2024 •

edited

Loading

alexander-beedie commented Aug 22, 2024

alexander-beedie commented Oct 11, 2024

D3SL commented Oct 13, 2024 •

edited

Loading

feat(python): Add 'drop_empty_rows' parameter for read_excel #18253

feat(python): Add 'drop_empty_rows' parameter for read_excel #18253

Conversation

Rashik-raj commented Aug 18, 2024

codecov bot commented Aug 18, 2024 • edited Loading

Codecov Report

alexander-beedie commented Aug 22, 2024

alexander-beedie commented Oct 11, 2024

D3SL commented Oct 13, 2024 • edited Loading

feat(python): Add 'drop_empty_rows' parameter for `read_excel` #18253

feat(python): Add 'drop_empty_rows' parameter for `read_excel` #18253

codecov bot commented Aug 18, 2024 •

edited

Loading

D3SL commented Oct 13, 2024 •

edited

Loading