Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*_ndjson 'utf-8' encoding issue #10034

Closed
2 tasks done
getorca opened this issue Jul 23, 2023 · 4 comments
Closed
2 tasks done

*_ndjson 'utf-8' encoding issue #10034

getorca opened this issue Jul 23, 2023 · 4 comments
Labels
accepted Ready for implementation bug Something isn't working python Related to Python Polars regression Issue introduced by a new release

Comments

@getorca
Copy link

getorca commented Jul 23, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.read_ndjson('./test.jsonl').filter(
    pl.col('structured_e').list.lengths() > 0
).select(
    ['structured_e']
).explode('structured_e').with_columns([
    pl.col('structured_e').struct.rename_fields(['@type', 'input', 'output']).alias('fields'),
]).unnest('fields').drop('structured_e')

print(df.select(['output'])[2,0])

the test.jsonl file is available here https://gist.github.com/getorca/d3a6460f0d14b573c1d38322828d34d8#file-test-jsonl

Issue description

throws a unicode error

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 245: unexpected end of data
thread '<unnamed>' panicked at 'Python API call failed', [/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.19.0/src/err/mod.rs:790:5](https://file+.vscode-resource.vscode-cdn.net/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.19.0/src/err/mod.rs:790:5)
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[103], line 1
----> 1 qa_df.select(['output'])[2,0]

File [~/Projects/fin_crawl/.venv/lib/python3.10/site-packages/polars/dataframe/frame.py:1537](https://file+.vscode-resource.vscode-cdn.net/home/lawrence/Projects/fin_crawl/notebooks/~/Projects/fin_crawl/.venv/lib/python3.10/site-packages/polars/dataframe/frame.py:1537), in DataFrame.__getitem__(self, item)
   1533         raise ValueError(
   1534             f'Column index "{col_selection}" is out of bounds.'
   1535         )
   1536     series = self.to_series(col_selection)
-> 1537     return series[row_selection]
   1539 if isinstance(col_selection, list):
   1540     # df[:, [1, 2]]
   1541     if is_int_sequence(col_selection):

File [~/Projects/fin_crawl/.venv/lib/python3.10/site-packages/polars/series/series.py:950](https://file+.vscode-resource.vscode-cdn.net/home/lawrence/Projects/fin_crawl/notebooks/~/Projects/fin_crawl/.venv/lib/python3.10/site-packages/polars/series/series.py:950), in Series.__getitem__(self, item)
    948     if item < 0:
    949         item = self.len() + item
--> 950     return self._s.get_idx(item)
    952 # Slice.
    953 elif isinstance(item, slice):

PanicException: Python API call failed

It seems to be related to read_ndjson and scan_ndjson as loading with from_dicts the string is encoded properly

my_dict = [{"structured_e": [{"@type": "QA", "input": "What Is the Federal Minimum Wage?", "output": "The minimum wage is the lowest amount of money that employers must legally pay their workers. This amount is normally quoted as an hourly figure. In the United States, the federal minimum wage is $7.25, which has been in effect since July 2009. This figure applies to employees covered under the Fair Labor Standards Act who are non-exempt. Keep in mind that state and local governments may have their own minimum wages.\n\n"}, {"@type": "QA", "input": "What Is the Meaning of Living Wage?", "output": "A living wage is the amount of money that theoretically allows them to afford the necessities in life, including food, clothing, and shelter. The idea behind paying workers a living wage is to prevent them from falling into poverty and maintaining a good standard of living. There is not necessarily a consensus about what makes a suitable living wage because it all depends on economic conditions, location, and the cost of living.\n\n what if we add a ton more text does it get cut off??"}, {"@type": "QA", "input": "Which Countries Pay the Highest Minimum Wage?", "output": "According to the World Economic Forum, the countries with the highest minimum wage are:\n\n* Luxembourg: €15,66\n* Australia: $21.38\n* New Zealand: $21.20\n* Netherlands: €10.14\n* The United Kingdom: £9.50\n* France: €10.57\n* Germany: €10.45\n"}]}]

df = pl.from_dicts(my_dict).select(
    ['structured_e']
).explode('structured_e').with_columns([
    pl.col('structured_e').struct.rename_fields(['@type', 'input', 'output']).alias('fields'),
]).unnest('fields').drop('structured_e')

print(df.select(['output'])[2,0])

also adding several characters to the end of the json value like "...France: €10.57\n* Germany: €10.45\n some more random words make it work" results in it being encoded properly

Expected behavior

string is encoded properly.

Installed versions

--------Version info---------
Polars:              0.18.8
Index type:          UInt32
Platform:            Linux-5.4.0-153-generic-x86_64-with-glibc2.31
Python:              3.10.12 (main, Jun  7 2023, 12:45:35) [GCC 9.4.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2023.6.0
matplotlib:          <not installed>
numpy:               1.25.1
pandas:              2.0.3
pyarrow:             12.0.1
pydantic:            <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
@getorca getorca added bug Something isn't working python Related to Python Polars labels Jul 23, 2023
@cjackal
Copy link
Contributor

cjackal commented Jul 24, 2023

This is a regression introduced between polars==0.18.3 and polars==0.18.4, and the exact bug introduced seems:

  • all escaped JSON strings are truncated and not unescaped in the output
  • all (unescaped) encoded unicode bytes (very much likely) results in decoding error or wrongly parsed output string

@universalmind303 universalmind303 added regression Issue introduced by a new release accepted Ready for implementation labels Jul 24, 2023
@ritchie46
Copy link
Member

Fixed by #10093

@versionbayjc
Copy link

I believe this Issue can be closed, I could not reproduce the issue in 0.19.8

@ritchie46
Copy link
Member

Yeap, this is fixed. Thanks for reminding us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working python Related to Python Polars regression Issue introduced by a new release
Projects
Archived in project
Development

No branches or pull requests

5 participants