-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): Add 'drop_empty_rows' parameter for read_excel
#18253
feat(python): Add 'drop_empty_rows' parameter for read_excel
#18253
Conversation
858ef8c
to
7c8ebe3
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #18253 +/- ##
=======================================
Coverage 80.23% 80.23%
=======================================
Files 1500 1500
Lines 198871 198883 +12
Branches 2837 2838 +1
=======================================
+ Hits 159556 159582 +26
+ Misses 38788 38774 -14
Partials 527 527 ☔ View full report in Codecov by Sentry. |
Thanks - I'll look at this shortly 👍 |
read_excel
Apologies for the delay in review! |
If I'm reading this right it doesn't address the second problem mentioned in #14874: columns which are sorted to have more leading NAs than the default schema inference length will be set to For the same reason I'd strongly urge that this PR's feature should be Think about it. Someone uses polars to read a CSV or Excel file. The data is new to them so they don't know the exact contents or descriptive statistics. That's the whole reason they're opening it in Polars. A couple dozen rows and two columns are silently deleted. There is no error or warning message. How should they test for that data loss? They don't know how many rows there should be, or whether those two columns were actually completely empty or not so they can't programmatically test the data by asserting an expected count of rows or non-null values. The only way to spot this would be if they already knew the answer ahead of time, or if they open the data in excel and look at the entire thing by eye every single time. Now apply it to the real world. Two columns are for recording errors, when there are none that value is left null. An empty row means there was a problem with the system and a record was created but no data was retrieved. Polars' defaults would silently drop the empty rows and null out the error columns, leading someone to believe everything is fine when in reality potentially a large portion of their records show errors. Going by fastexcel's issue documenting this happening with only ~143k out of 500k rows someone could lose over two thirds of their data completely silently. Not even a warning that entire columns were deleted due to the first ~30% being empty. [edit] As an update the Fastexcel team plan on addressing this in the version after next. I do think it should still be noted in Polars' documentation that in pre-0.13.0 versions of the fastexcel backend columns may be lost depending on if there are more missing values than the default schema length, and the solution is |
fixes: 18250
fixes: 14874
I've added a new flag to maintain backward compatibility for automatically dropping empty rows, allowing users to choose whether or not to drop them in Excel. It seems that dropping empty rows might be necessary for some users, so it could be useful to implement a similar option for other methods, such as
read_csv
.