-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_ndjson does not properly decode escaped chars in strings #9791
Comments
Perhaps a more visually obvious example: import io
import polars as pl
import pandas as pd
ndjson = br"""
{"posixT":1688667089,"path":"C:\\Users\\ABC\\DEF"}
{"posixT":1688667129,"path":"C:\\Users\\GHI\\JKL"}
"""
pl.from_pandas(pd.read_json(io.BytesIO(ndjson), lines=True))
# shape: (2, 2)
# ┌────────────┬──────────────────┐
# │ posixT ┆ path │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞════════════╪══════════════════╡
# │ 1688667089 ┆ C:\Users\ABC\DEF │
# │ 1688667129 ┆ C:\Users\GHI\JKL │
# └────────────┴──────────────────┘
pl.read_ndjson(ndjson)
# shape: (2, 2)
# ┌────────────┬──────────────────┐
# │ posixT ┆ path │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞════════════╪══════════════════╡
# │ 1688667089 ┆ C:\\Users\\ABC\\ │
# │ 1688667129 ┆ C:\\Users\\GHI\\ │
# └────────────┴──────────────────┘ |
Possibly Windows-specific? I see the expected results on an M1 MacBook: # shape: (2, 2)
# ┌────────────┬──────────────────┐
# │ posixT ┆ path │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞════════════╪══════════════════╡
# │ 1688667089 ┆ C:\Users\ABC\DEF │
# │ 1688667129 ┆ C:\Users\GHI\JKL │
# └────────────┴──────────────────┘ @cmdlineluser, could you confirm your platform? (to see if it's also Windows) |
Ah, good point. I'm on macOS @alexander-beedie Update: full version info: In [7]: pl.show_versions()
--------Version info---------
Polars: 0.18.6
Index type: UInt32
Platform: macOS-12.6.7-arm64-arm-64bit
Python: 3.11.4 (main, Jun 15 2023, 07:52:47) [Clang 14.0.0 (clang-1400.0.29.202)]
----Optional dependencies----
adbc_driver_sqlite: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fsspec: <not installed>
matplotlib: <not installed>
numpy: 1.25.0
pandas: 2.0.3
pyarrow: 12.0.1
pydantic: <not installed>
sqlalchemy: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
In [8]: pl.read_ndjson(ndjson)
Out[8]:
shape: (2, 2)
┌────────────┬──────────────────┐
│ posixT ┆ path │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════════╪══════════════════╡
│ 1688667089 ┆ C:\\Users\\ABC\\ │
│ 1688667129 ┆ C:\\Users\\GHI\\ │
└────────────┴──────────────────┘ |
@alexander-beedie are you sure this was the result from the pl.read_ndjson() ? @cmdlineluser is also on Apple Silicon it seems. On my intel MBP, I get very similar show_versions() dependencies as @cmdlineluser (only different platform: macOS kernel and arch). Also I just did the same exercise on a Debian 11 VM using python 3.9 and also same bug there:
|
Hmm. Perhaps it's because I'm on the latest compiled HEAD version and something was fixed upstream... Too late to check now, it's gone midnight where I am - I'll see if that might make sense as an explanation tomorrow 🤔 |
@ritchie46: Looks like we have ourselves a Heisenbug :) Couldn't identify any obvious codepaths or changes that would explain why I was seeing the correct results but everyone else was not; then I realised I was likely the only one running a debug build and, sure enough... the example given above works with debug builds (eg: |
Just to be sure So if I run this in pub fn run() {
let mut d = r#"{"posixT":1688667089,"path":"C:\\Users\\ABC\\DEF"}"#.as_bytes().to_vec();
let v = simd_json::to_owned_value(&mut d).unwrap();
dbg!(v);
}
The result is wrong? |
\ in strings in JSON need to be escaped yes, see for instance this reference in the RFC:
I'm not that familiar with rust (yet?), but with a quick modification of a Hello World, I think the let s = "Hello\\\n, world!";
println!("{}", s);
dbg!(s); for me shows the expected string printed with a single \ and a newline, and then the escaped string by
|
fixed by #10093 |
Hi @ritchie46, Did I misunderstand something or was the ub fixed, but the bug with escaping is still present? Thanks!
|
Sorry, my bad. |
I thought I'd try this again and I think this issue was fixed as part of the mentioned #10761. |
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Issue description
I'm no expert in this area, but I saw something very strange when using Polars read_ndjson from Python. I've simplified my example to show the (in my eyes) bug when importing a NDJSON-file with strings with \. Please save the following lines in a testData.njdson to reproduce the bug with the code below.
Since the data is imported correctly with Pandas, I believe this is a bug in Polars
Reproducible example
Expected behavior
This is the output I see when running the repro example:
What is wrong: the backslashes in the outputs from Polars are shown duplicated. In JSON they are escaped, but in a string they should not be escaped anymore I think, also the last 3 characters are missing.
It's as if the 3 duplicated backslashes that are shown caused the last 3 characters to be cut off.
I expect to see the same string for "path" as Pandas is showing
Installed versions
The text was updated successfully, but these errors were encountered: