-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve schema inference - wrongly assigns null dtype and misses reading valid data at end of large file #208
Comments
Hello, this should be partially fixed by #195, which allows to specify dtypes. It will be part of 0.10.0, which we intend to release soon. In the meantime, maybe you could check with the master version of Regarding your file, I'm not sure what Dtype you expect for |
Hi @lukapeschke, Thank you for looking into this issue.
I gave the test.xlsx file with empty 1st column, to show the observation of dtype inferred as null.
Thank you for the reference to use dtypes parameter, i tried it and the observation is: polars library is applying schema_override after xlsx data is read by fastexcel as in below code: def _read_spreadsheet_calamine(
parser: Any,
sheet_name: str | None,
read_options: dict[str, Any],
schema_overrides: SchemaDict | None,
*,
raise_if_empty: bool,
) -> pl.DataFrame:
ws = parser.load_sheet_by_name(sheet_name, **read_options)
df = ws.to_polars()
if schema_overrides:
df = df.cast(dtypes=schema_overrides) And i do not see read_options override to pass dtypes to fastexcel ExcelReader class as in below code : class _ExcelReader:
"""A class representing an open Excel file and allowing to read its sheets"""
def load_sheet_by_name(
self,
name: str,
*,
header_row: int | None = 0,
column_names: list[str] | None = None,
skip_rows: int = 0,
n_rows: int | None = None,
schema_sample_rows: int | None = 1_000,
) -> _ExcelSheet: ... Hence, If we could have an option to read all xlsx fields "as-is" or as Strings [without schema inference] , it will be great! |
But in that case, the inferred
In your case, it should be possible to use With the file you provided, this would work as follows (requires import polars as pl
df = pl.read_excel('issue_208.xlsx', engine='calamine', read_options={'dtypes': {'Source.Name': 'string', 'Date': 'string'}}) This results in the following dataframe (types for columns which are not specified in
|
Details on steps followed to test this case :
1, Use the attached test xlsx file, Get the column count by reading first 2 lines: 2, Prepare the DTypeMap with data type as "string" for all columns, using integer indexing:
3, Used dtypes spec framed in Step 2 above and able to read data without any loss :
Details on steps followed to test this case :
1, Use the attached test xlsx file, read it as dataframe: 2, Write the dataframe as csv and validate data [1st column is all empty in output csv file but test.xlsx file has valid data for 1st column at last couple of rows]: To summarize, now able to read data without loss with dtypes specification, but will let you decide how to handle the case when default inference results in data loss. Also, let me know whether you want to keep this Issue ticket open until the default case is fixed or you want to track it separate. |
Hello @SanRepo thanks for the update! If you do not wish to specify |
Team,
Trying to read large xlsx file ( around 500k rows with 8 columns ), was using Pandas with calamine engine before, all good and got to know about polars with fastexcel [calamine], tried to adopt it and found this bug.
Observation:
During reading above large xlsx file with 1st column having valid value only at 143,407th row on overall 500k records xlsx file, I observe two issues:
Code details:
import polars as pl df = pl.read_excel(source = <<path of test.xlsx (attached)>>, sheet_id=1,engine="calamine", raise_if_empty = False print(df.sample(10))
import pandas as pd df = pd.read_excel(path, sheet_name=0, engine="calamine", dtype=str,na_filter=False)
Step to reproduce the issue:
Have the attached test.xlsx file with 1st column all empty/null and run above polars code. It prints 1st column inferred data type as null.
Please refer to pandas code above, that works for me without any data loss.
Technical Feature requests to resolve the issue:
1, ExcelReader option to read xlsx fields "as-is" / as Strings [without schema inference]
2, ExcelReader option to control Schema Inference row limit [ Hope it will help other use cases with less records, but not in my use case ]
System/lib details:
Windows OS , Python 3.11.5, polars==0.20.16, fastexcel==0.9.1,pandas==2.2.1,python-calamine==0.2.0
test.xlsx
The text was updated successfully, but these errors were encountered: