CSV to JSONL: wrong conversion? #1418

aborruso · 2023-10-31T09:51:54Z

Hi,
I have this input

a,b,c
a,3,"lorem, ipsum",5
a,b
a,3,f,5,c

If I run

mlr --icsv --ojsonl -S -N --ragged cat input.csv

I get

{"1": "a", "2": "b", "3": "c"}
{"1": "a", "2": "3", "3": "lorem, ipsum", "4": "5"}
{"1": "a", "2": "b", "3": ""}
{"1": "a", "2": "3", "3": "f", "4": "5"}

Two questions:

why I have 3 fields in jsonl row 3?
why I do not have the fifth field in in jsonl row 4?

Thank you

The text was updated successfully, but these errors were encountered:

aborruso · 2023-10-31T10:17:07Z

Using miller 5 I get the result that feels right to me (mlr --icsv --ojson -N cat input.csv)

{ "1": "a", "2": "b", "3": "c" }
{ "1": "a", "2": 3, "3": "lorem, ipsum", "4": 5 }
{ "1": "a", "2": "b" }
{ "1": "a", "2": 3, "3": "f", "4": 5, "5": "c" }

aborruso · 2023-10-31T17:42:58Z

@johnkerl I discovered it, because I wanted to give a good answer to this #1417

johnkerl · 2023-11-12T00:31:10Z

@aborruso

Why I do not have the fifth field in in jsonl row 4? -- this is an embarrassing bug. There wasn't a loop where there should have been, so at most one extra column was handled. This was wrong and I will fix it.

Why I have 3 fields in jsonl row 3? -- this was by intentional design, although maybe that wasn't a good design decision. Namely: https://miller.readthedocs.io/en/latest/record-heterogeneity/#ragged-data:

Using the --allow-ragged-csv-input flag we can fill values in too-short rows, and provide a key (column number starting with 1) for too-long rows

It's not really necessary (and, as you saw, contrary to your expectations) to do the "fill values in too-short rows" part.

aborruso · 2023-11-12T08:27:09Z

It's not really necessary (and, as you saw, contrary to your expectations) to do the "fill values in too-short rows" part.

I don't know if I understood you: but if the output is jsonl, I also think that mlr should not add anything in row 3.

Conversely, I think that if here the output is csv, which is a rectangular format, there should be unsparsify by default (this is an old proposal of mine). Using data/het/ragged.csv and running mlr --icsv --ocsv -S --ragged cat ragged.csv to have

a,b,c,4
1,2,3,
4,5,,
7,8,9,10

and not

a,b,c
1,2,3
4,5,

a,b,c,4
7,8,9,10

I would, on the contrary, introduce for the CSV format the "sparsify" option.

Thank you always for your time and kindness

johnkerl · 2023-11-20T03:09:02Z

I agree that the "fill in values in too-short rows" does not belong there.

As we've discussed before, doing unsparsify by default appears to require reading all data in a non-streaming way and I am unwilling to break one of Miller's fundamental features, namely, it does streaming processing whenever possible.

That said, I had an idea the other day that may make it possible to make to do unsparsify in a memory-efficient, stream-friendly way ...

johnkerl · 2023-11-20T03:33:48Z

@aborruso #1428 fixes the first issue.

The other (auto-unsparsify for CSV) will require another PR.

Here is the approach I've always said is unacceptable to me, because it breaks streaming-by-default:

If input format is CSV:
- Insert an unsparsify -a verb before the user-provided then-chain

This is bad because unsparsify -a requires us to read all input before producing any output (e.g. what if the first 999,999,999 records have NF=5 but the 1,000,000,000th has NF=6?)

What should work:

If input format is CSV:
- Modify the CSV record-reader itself to do the same thing mlr unsparsify -a would have done ...
- ... except here the CSV record-reader has the header row to use as its list of "all field names" ...
- ... and this is done only for CSV/TSV input wherein we can assume the header line applies for the entire input file (which isn't the case for JSON, JSONL, DKVP, NIDX, XTAB ...)

johnkerl · 2023-11-20T04:59:23Z

... correction: the above is what we already had before #1428. My proposal above would simply put that back in.

Perhaps we could auto-unsparsify on CSV output ... 🤔

johnkerl · 2023-11-20T05:09:19Z

@aborruso let's talk about what we expect.

Some sample data:

$ cat ragged.csv
a,b,c
1,2
4,5,6

$ cat ragged.json
[
  {
    "a": 1,
    "b": 2
  },
  {
    "a": 4,
    "b": 5,
    "c": 6
  }
]

Status quo:

$ mlr --csv cat ragged.csv
mlr: mlr: CSV header/data length mismatch 3 != 2 at filename ragged.csv row 2.

$ mlr --ragged --csv cat ragged.csv
a,b
1,2

a,b,c
4,5,6

$ mlr --ragged --icsv --ojsonl cat ragged.csv
{"a": 1, "b": 2}
{"a": 4, "b": 5, "c": 6}

$ mlr --json cat ragged.json
[
{
  "a": 1,
  "b": 2
},
{
  "a": 4,
  "b": 5,
  "c": 6
}
]

$ mlr --ijson --ocsv cat ragged.json
a,b
1,2

a,b,c
4,5,6

My proposal:

# NO CHANGE:
$ mlr --csv cat ragged.csv
mlr: mlr: CSV header/data length mismatch 3 != 2 at filename ragged.csv row 2.
.

# CHANGE:
$ mlr --ragged --csv cat ragged.csv
a,b,c
1,2,
4,5,6

# NO CHANGE:
$ mlr --ragged --icsv --ojsonl cat ragged.csv
{"a": 1, "b": 2}
{"a": 4, "b": 5, "c": 6}

# NO CHANGE:
$ mlr --json cat ragged.json
[
{
  "a": 1,
  "b": 2
},
{
  "a": 4,
  "b": 5,
  "c": 6
}
]

# CHANGE:
$ mlr --ijson --ocsv cat ragged.json
a,b,c
1,2,
4,5,6

aborruso · 2023-11-20T07:13:13Z

@johnkerl thank for your time and for your ideas. I like your proposals.

A last question. I did not understand whether starting from this

a,b,c
a,3,"lorem, ipsum",5
a,b
a,3,f,5,c

and running mlr --icsv --ojsonl -S -N --ragged cat input.csv, I still will get this, with "3" field null value at row three?

{"1": "a", "2": "b", "3": "c"}
{"1": "a", "2": "3", "3": "lorem, ipsum", "4": "5"}
{"1": "a", "2": "b", "3": ""}
{"1": "a", "2": "3", "3": "f", "4": "5"}

Thank you

johnkerl · 2023-11-20T14:20:06Z

@aborruso already as of #1428 (which has been merged) we have

$ mlr --icsv --ojsonl -S -N --ragged cat input.csv
{"1": "a", "2": "b", "3": "c"}
{"1": "a", "2": "3", "3": "lorem, ipsum", "4": "5"}
{"1": "a", "2": "b"}
{"1": "a", "2": "3", "3": "f", "4": "5", "5": "c"}

johnkerl · 2023-11-20T14:20:54Z

The next question I'm asking is requirements for CSV auto-unsparsify -- my proposal is that we should do it on output, always, for CSV. (My doing it on CSV input, if --ragged, is what got us here.)

aborruso · 2023-11-20T15:32:37Z

my proposal is that we should do it on output, always, for CSV.

I love it, thank you very much

johnkerl · 2024-01-20T23:44:21Z

@aborruso I think you'll be happy with #1479 ... please let me know ...

aborruso · 2024-01-21T06:54:10Z

John I think it's a great thing for all users, especially the new users, because the CSV files are rectangular, then it's great to have Miller create by default a right CSV file.

Thank you for all the things you do for us.

aborruso · 2024-02-13T09:48:06Z

@johnkerl I have a unsparsify related question.

I have this input csv:

id,Year,Neighbourhood_name,Category,Gender,Amount
1,2019,Emilstorp,0-5 years,Male,15
2,2019,Emilstorp,6-15 years,Female,25
3,2021,Emilstorp,0-5 years,Male,20

If I run

mlr --csv cut -x -f Gender then reshape -s Category,Amount input.csv

I have this error:

mlr: CSV schema change: first keys "id,Year,Neighbourhood_name,0-5 years"; current keys "id,Year,Neighbourhood_name,6-15 years"
mlr: exiting due to data error.

It's a wrong reshape, because I must cut Gender and id, but If I change format --c2m, I have no error, probably because the unsparsify command is not forced.

So it is probably okay to have that error, but for the non-expert user it is a message that does not help to find the solution. What do you think about? I'm not making any suggestions for improvement, because I don't have any at the moment.

Thank you

johnkerl · 2024-02-13T14:39:11Z

New issue: #1495

johnkerl added the bug label Oct 31, 2023

johnkerl self-assigned this Oct 31, 2023

johnkerl added the active label Nov 12, 2023

johnkerl mentioned this issue Nov 20, 2023

Fix ragged-CSV auto-pad #1428

Merged

johnkerl mentioned this issue Jan 20, 2024

Auto-unsparsify CSV and TSV on output #1479

Merged

johnkerl closed this as completed in #1479 Jan 20, 2024

johnkerl removed the active label Jan 20, 2024

johnkerl mentioned this issue Feb 13, 2024

Question related to unsparsify #1495

Closed

johnkerl mentioned this issue Mar 26, 2024

JSON to CSV Error #1534

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV to JSONL: wrong conversion? #1418

CSV to JSONL: wrong conversion? #1418

aborruso commented Oct 31, 2023

aborruso commented Oct 31, 2023 •

edited

Loading

aborruso commented Oct 31, 2023

johnkerl commented Nov 12, 2023

aborruso commented Nov 12, 2023 •

edited

Loading

johnkerl commented Nov 20, 2023

johnkerl commented Nov 20, 2023

johnkerl commented Nov 20, 2023

johnkerl commented Nov 20, 2023

aborruso commented Nov 20, 2023

johnkerl commented Nov 20, 2023

johnkerl commented Nov 20, 2023 •

edited

Loading

aborruso commented Nov 20, 2023

johnkerl commented Jan 20, 2024

aborruso commented Jan 21, 2024

aborruso commented Feb 13, 2024

johnkerl commented Feb 13, 2024 •

edited

Loading

CSV to JSONL: wrong conversion? #1418

CSV to JSONL: wrong conversion? #1418

Comments

aborruso commented Oct 31, 2023

aborruso commented Oct 31, 2023 • edited Loading

aborruso commented Oct 31, 2023

johnkerl commented Nov 12, 2023

aborruso commented Nov 12, 2023 • edited Loading

johnkerl commented Nov 20, 2023

johnkerl commented Nov 20, 2023

johnkerl commented Nov 20, 2023

johnkerl commented Nov 20, 2023

aborruso commented Nov 20, 2023

johnkerl commented Nov 20, 2023

johnkerl commented Nov 20, 2023 • edited Loading

aborruso commented Nov 20, 2023

johnkerl commented Jan 20, 2024

aborruso commented Jan 21, 2024

aborruso commented Feb 13, 2024

johnkerl commented Feb 13, 2024 • edited Loading

aborruso commented Oct 31, 2023 •

edited

Loading

aborruso commented Nov 12, 2023 •

edited

Loading

johnkerl commented Nov 20, 2023 •

edited

Loading

johnkerl commented Feb 13, 2024 •

edited

Loading