Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_ndjson does not properly decode escaped chars in strings #9791

Closed
2 tasks done
versionbayjc opened this issue Jul 9, 2023 · 12 comments
Closed
2 tasks done

read_ndjson does not properly decode escaped chars in strings #9791

versionbayjc opened this issue Jul 9, 2023 · 12 comments
Assignees
Labels
accepted Ready for implementation bug Something isn't working python Related to Python Polars

Comments

@versionbayjc
Copy link

versionbayjc commented Jul 9, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

I'm no expert in this area, but I saw something very strange when using Polars read_ndjson from Python. I've simplified my example to show the (in my eyes) bug when importing a NDJSON-file with strings with \. Please save the following lines in a testData.njdson to reproduce the bug with the code below.

{"posixT":1688667089,"path":"C:\\Users\\bla\\bla"}
{"posixT":1688667129,"path":"C:\\Users\\bla\\bla"}

Since the data is imported correctly with Pandas, I believe this is a bug in Polars

Reproducible example

import polars as pl
import pandas as pd

ndjsonExampleFile = r"testData.ndjson"

# works as expected
print(pd.read_json(ndjsonExampleFile, lines=True))

# does not work correctly
print(pl.read_ndjson(ndjsonExampleFile))

Expected behavior

This is the output I see when running the repro example:

       posixT              path
0  1688667089  C:\Users\bla\bla
1  1688667129  C:\Users\bla\bla
shape: (2, 2)
┌────────────┬──────────────────┐
│ posixT     ┆ path             │
│ ---        ┆ ---              │
│ i64        ┆ str              │
╞════════════╪══════════════════╡
│ 1688667089 ┆ C:\\Users\\bla\\ │
│ 1688667129 ┆ C:\\Users\\bla\\ │
└────────────┴──────────────────┘

What is wrong: the backslashes in the outputs from Polars are shown duplicated. In JSON they are escaped, but in a string they should not be escaped anymore I think, also the last 3 characters are missing.

It's as if the 3 duplicated backslashes that are shown caused the last 3 characters to be cut off.

I expect to see the same string for "path" as Pandas is showing

Installed versions

>>> pl.show_versions()
--------Version info---------
Polars:              0.18.6
Index type:          UInt32
Platform:            Windows-10-10.0.19045-SP0
Python:              3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2023.6.0
matplotlib:          <not installed>
numpy:               1.25.0
pandas:              2.0.3
pyarrow:             12.0.1
pydantic:            1.10.11
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
@versionbayjc versionbayjc added bug Something isn't working python Related to Python Polars labels Jul 9, 2023
@cmdlineluser
Copy link
Contributor

Perhaps a more visually obvious example:

import io
import polars as pl
import pandas as pd

ndjson = br"""
{"posixT":1688667089,"path":"C:\\Users\\ABC\\DEF"}
{"posixT":1688667129,"path":"C:\\Users\\GHI\\JKL"}
"""

pl.from_pandas(pd.read_json(io.BytesIO(ndjson), lines=True))
# shape: (2, 2)
# ┌────────────┬──────────────────┐
# │ posixT     ┆ path             │
# │ ---        ┆ ---              │
# │ i64        ┆ str              │
# ╞════════════╪══════════════════╡
# │ 1688667089 ┆ C:\Users\ABC\DEF │
# │ 1688667129 ┆ C:\Users\GHI\JKL │
# └────────────┴──────────────────┘

pl.read_ndjson(ndjson)
# shape: (2, 2)
# ┌────────────┬──────────────────┐
# │ posixT     ┆ path             │
# │ ---        ┆ ---              │
# │ i64        ┆ str              │
# ╞════════════╪══════════════════╡
# │ 1688667089 ┆ C:\\Users\\ABC\\ │
# │ 1688667129 ┆ C:\\Users\\GHI\\ │
# └────────────┴──────────────────┘

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jul 10, 2023

Possibly Windows-specific? I see the expected results on an M1 MacBook:

# shape: (2, 2)
# ┌────────────┬──────────────────┐
# │ posixT     ┆ path             │
# │ ---        ┆ ---              │
# │ i64        ┆ str              │
# ╞════════════╪══════════════════╡
# │ 1688667089 ┆ C:\Users\ABC\DEF │
# │ 1688667129 ┆ C:\Users\GHI\JKL │
# └────────────┴──────────────────┘

@cmdlineluser, could you confirm your platform? (to see if it's also Windows)

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Jul 10, 2023

Ah, good point. I'm on macOS @alexander-beedie

Update: full version info:

In [7]: pl.show_versions()
--------Version info---------
Polars:              0.18.6
Index type:          UInt32
Platform:            macOS-12.6.7-arm64-arm-64bit
Python:              3.11.4 (main, Jun 15 2023, 07:52:47) [Clang 14.0.0 (clang-1400.0.29.202)]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
matplotlib:          <not installed>
numpy:               1.25.0
pandas:              2.0.3
pyarrow:             12.0.1
pydantic:            <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

In [8]: pl.read_ndjson(ndjson)
Out[8]: 
shape: (2, 2)
┌────────────┬──────────────────┐
│ posixTpath             │
│ ------              │
│ i64str              │
╞════════════╪══════════════════╡
│ 1688667089C:\\Users\\ABC\\ │
│ 1688667129C:\\Users\\GHI\\ │
└────────────┴──────────────────┘

@versionbayjc
Copy link
Author

@alexander-beedie are you sure this was the result from the pl.read_ndjson() ? @cmdlineluser is also on Apple Silicon it seems.

On my intel MBP, I get very similar show_versions() dependencies as @cmdlineluser (only different platform: macOS kernel and arch). Also I just did the same exercise on a Debian 11 VM using python 3.9 and also same bug there:

Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> import polars as pl
>>> import pandas as pd
>>> 
>>> ndjson = br"""
... {"posixT":1688667089,"path":"C:\\Users\\ABC\\DEF"}
... {"posixT":1688667129,"path":"C:\\Users\\GHI\\JKL"}
... """
>>> 
>>> pl.from_pandas(pd.read_json(io.BytesIO(ndjson), lines=True))
shape: (2, 2)
┌────────────┬──────────────────┐
│ posixT     ┆ path             │
│ ---        ┆ ---              │
│ i64        ┆ str              │
╞════════════╪══════════════════╡
│ 1688667089 ┆ C:\Users\ABC\DEF │
│ 1688667129 ┆ C:\Users\GHI\JKL │
└────────────┴──────────────────┘
>>> pl.read_ndjson(ndjson)
shape: (2, 2)
┌────────────┬──────────────────┐
│ posixT     ┆ path             │
│ ---        ┆ ---              │
│ i64        ┆ str              │
╞════════════╪══════════════════╡
│ 1688667089 ┆ C:\\Users\\ABC\\ │
│ 1688667129 ┆ C:\\Users\\GHI\\ │
└────────────┴──────────────────┘
>>> pl.show_versions()
--------Version info---------
Polars:              0.18.6
Index type:          UInt32
Platform:            Linux-5.10.0-23-amd64-x86_64-with-glibc2.31
Python:              3.9.2 (default, Feb 28 2021, 17:03:44) 
[GCC 10.2.1 20210110]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
matplotlib:          <not installed>
numpy:               1.25.1
pandas:              2.0.3
pyarrow:             12.0.1
pydantic:            <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jul 10, 2023

Hmm. Perhaps it's because I'm on the latest compiled HEAD version and something was fixed upstream... Too late to check now, it's gone midnight where I am - I'll see if that might make sense as an explanation tomorrow 🤔

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jul 14, 2023

@ritchie46: Looks like we have ourselves a Heisenbug :)

Couldn't identify any obvious codepaths or changes that would explain why I was seeing the correct results but everyone else was not; then I realised I was likely the only one running a debug build and, sure enough... the example given above works with debug builds (eg: maturin develop) but not with optimised release builds (eg: compiled with maturin develop --release -- -C target-cpu=native). Any ideas?

@ritchie46 ritchie46 self-assigned this Jul 14, 2023
@ritchie46
Copy link
Member

Just to be sure JSON must escape \?

So if I run this in simd-json:

pub fn run()  {
    let mut d = r#"{"posixT":1688667089,"path":"C:\\Users\\ABC\\DEF"}"#.as_bytes().to_vec();
    let v = simd_json::to_owned_value(&mut d).unwrap();

    dbg!(v);

}
[src/issue.rs:6] v = Object(
    {
        "posixT": Static(
            I64(
                1688667089,
            ),
        ),
        "path": String(
            "C:\\Users\\ABC\\DEF",
        ),
    },
)

The result is wrong?

@versionbayjc
Copy link
Author

\ in strings in JSON need to be escaped yes, see for instance this reference in the RFC:
https://www.rfc-editor.org/rfc/rfc8259#section-7

   The representation of strings is similar to conventions used in the C
   family of programming languages.  A string begins and ends with
   quotation marks.  All Unicode characters may be placed within the
   quotation marks, except for the characters that MUST be escaped:
   quotation mark, reverse solidus, and the control characters (U+0000
   through U+001F).
   [...]
   Alternatively, there are two-character sequence escape
   representations of some popular characters.  So, for example, a
   string containing only a single reverse solidus character may be
   represented more compactly as "\\".

I'm not that familiar with rust (yet?), but with a quick modification of a Hello World, I think the dbg is escaping characters for visualization purposes, so then the results are actually correct. For instance:

    let s = "Hello\\\n, world!";

    println!("{}", s);
    dbg!(s);

for me shows the expected string printed with a single \ and a newline, and then the escaped string by dbg:

Hello\
, world!
[src/main.rs:6] s = "Hello\\\n, world!"

@ritchie46
Copy link
Member

fixed by #10093

@versionbayjc
Copy link
Author

Hi @ritchie46,
If I understand correctly, with #10093 referenced in https://github.com/pola-rs/polars/releases/tag/py-0.18.9, this bug should no longer be present in 0.18.9, however I can still reproduce the bug with 0.18.9...

Did I misunderstand something or was the ub fixed, but the bug with escaping is still present?

Thanks!

(venv) PS C:\Users\JorikCaljouw\src\polars> python
Python 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> pl.show_versions()
--------Version info---------
Polars:              0.18.9
Index type:          UInt32
Platform:            Windows-10-10.0.19045-SP0
Python:              3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
matplotlib:          <not installed>
numpy:               1.25.1
pandas:              2.0.3
pyarrow:             12.0.1
pydantic:            <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
>>> ndjson = br"""
... {"posixT":1688667089,"path":"C:\\Users\\ABC\\DEF"}
... {"posixT":1688667129,"path":"C:\\Users\\GHI\\JKL"}
... """
>>> pl.read_ndjson(ndjson)
shape: (2, 2)
┌────────────┬──────────────────┐
│ posixT     ┆ path             │
│ ---        ┆ ---              │
│ i64        ┆ str              │
╞════════════╪══════════════════╡
│ 1688667089 ┆ C:\\Users\\ABC\\ │
│ 1688667129 ┆ C:\\Users\\GHI\\ │
└────────────┴──────────────────┘

@ritchie46 ritchie46 reopened this Jul 31, 2023
@ritchie46
Copy link
Member

Sorry, my bad.

@versionbayjc
Copy link
Author

I thought I'd try this again and I think this issue was fixed as part of the mentioned #10761.
The PR does mention this Issue number, but that apparently didn't trigger this issue to be closed? I'll close it though 🙂. Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working python Related to Python Polars
Projects
Archived in project
Development

No branches or pull requests

5 participants