Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Add 'drop_empty_rows' parameter for read_excel #18253

Conversation

Rashik-raj
Copy link
Contributor

fixes: 18250
fixes: 14874

I've added a new flag to maintain backward compatibility for automatically dropping empty rows, allowing users to choose whether or not to drop them in Excel. It seems that dropping empty rows might be necessary for some users, so it could be useful to implement a similar option for other methods, such as read_csv.

@Rashik-raj Rashik-raj force-pushed the add-option-to-drop-empty-row-in-excel branch from 858ef8c to 7c8ebe3 Compare August 18, 2024 10:00
Copy link

codecov bot commented Aug 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.23%. Comparing base (a284174) to head (7c8ebe3).
Report is 421 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #18253   +/-   ##
=======================================
  Coverage   80.23%   80.23%           
=======================================
  Files        1500     1500           
  Lines      198871   198883   +12     
  Branches     2837     2838    +1     
=======================================
+ Hits       159556   159582   +26     
+ Misses      38788    38774   -14     
  Partials      527      527           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Rashik-raj Rashik-raj changed the title Fix: Add option to drop empty row in excel to fix auto dropping of empty rows fix(python): Add option to drop empty row in excel to fix auto dropping of empty rows Aug 20, 2024
@github-actions github-actions bot added fix Bug fix python Related to Python Polars and removed title needs formatting labels Aug 20, 2024
@alexander-beedie
Copy link
Collaborator

Thanks - I'll look at this shortly 👍

@alexander-beedie alexander-beedie changed the title fix(python): Add option to drop empty row in excel to fix auto dropping of empty rows feat(python): Add 'drop_empty_rows' parameter for read_excel Oct 11, 2024
@github-actions github-actions bot added the enhancement New feature or an improvement of an existing feature label Oct 11, 2024
@alexander-beedie
Copy link
Collaborator

Apologies for the delay in review!
Thanks for the PR :)

@alexander-beedie alexander-beedie merged commit fc970f7 into pola-rs:main Oct 11, 2024
16 checks passed
@D3SL
Copy link

D3SL commented Oct 13, 2024

If I'm reading this right it doesn't address the second problem mentioned in #14874: columns which are sorted to have more leading NAs than the default schema inference length will be set to Null, silently deleting the entire column's worth of data.

For the same reason I'd strongly urge that this PR's feature should be False by default. Silently deleting data should always be treated as a showstopping bug, never a default behavior.

Think about it. Someone uses polars to read a CSV or Excel file. The data is new to them so they don't know the exact contents or descriptive statistics. That's the whole reason they're opening it in Polars. A couple dozen rows and two columns are silently deleted. There is no error or warning message.

How should they test for that data loss? They don't know how many rows there should be, or whether those two columns were actually completely empty or not so they can't programmatically test the data by asserting an expected count of rows or non-null values. The only way to spot this would be if they already knew the answer ahead of time, or if they open the data in excel and look at the entire thing by eye every single time.

Now apply it to the real world. Two columns are for recording errors, when there are none that value is left null. An empty row means there was a problem with the system and a record was created but no data was retrieved.

Polars' defaults would silently drop the empty rows and null out the error columns, leading someone to believe everything is fine when in reality potentially a large portion of their records show errors. Going by fastexcel's issue documenting this happening with only ~143k out of 500k rows someone could lose over two thirds of their data completely silently. Not even a warning that entire columns were deleted due to the first ~30% being empty.

[edit]

As an update the Fastexcel team plan on addressing this in the version after next. I do think it should still be noted in Polars' documentation that in pre-0.13.0 versions of the fastexcel backend columns may be lost depending on if there are more missing values than the default schema length, and the solution is schema_sample_rows=None.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature fix Bug fix python Related to Python Polars
Projects
None yet
3 participants