Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV to JSONL: wrong conversion? #1418

Closed
aborruso opened this issue Oct 31, 2023 · 16 comments · Fixed by #1479
Closed

CSV to JSONL: wrong conversion? #1418

aborruso opened this issue Oct 31, 2023 · 16 comments · Fixed by #1479
Assignees
Labels

Comments

@aborruso
Copy link
Contributor

Hi,
I have this input

a,b,c
a,3,"lorem, ipsum",5
a,b
a,3,f,5,c

If I run

mlr --icsv --ojsonl -S -N --ragged cat input.csv

I get

{"1": "a", "2": "b", "3": "c"}
{"1": "a", "2": "3", "3": "lorem, ipsum", "4": "5"}
{"1": "a", "2": "b", "3": ""}
{"1": "a", "2": "3", "3": "f", "4": "5"}

Two questions:

  • why I have 3 fields in jsonl row 3?
  • why I do not have the fifth field in in jsonl row 4?

Thank you

@aborruso
Copy link
Contributor Author

aborruso commented Oct 31, 2023

Using miller 5 I get the result that feels right to me (mlr --icsv --ojson -N cat input.csv)

{ "1": "a", "2": "b", "3": "c" }
{ "1": "a", "2": 3, "3": "lorem, ipsum", "4": 5 }
{ "1": "a", "2": "b" }
{ "1": "a", "2": 3, "3": "f", "4": 5, "5": "c" }

@johnkerl johnkerl added the bug label Oct 31, 2023
@johnkerl johnkerl self-assigned this Oct 31, 2023
@aborruso
Copy link
Contributor Author

@johnkerl I discovered it, because I wanted to give a good answer to this #1417

@johnkerl
Copy link
Owner

@aborruso

Why I do not have the fifth field in in jsonl row 4? -- this is an embarrassing bug. There wasn't a loop where there should have been, so at most one extra column was handled. This was wrong and I will fix it.

Why I have 3 fields in jsonl row 3? -- this was by intentional design, although maybe that wasn't a good design decision. Namely: https://miller.readthedocs.io/en/latest/record-heterogeneity/#ragged-data:

Using the --allow-ragged-csv-input flag we can fill values in too-short rows, and provide a key (column number starting with 1) for too-long rows

It's not really necessary (and, as you saw, contrary to your expectations) to do the "fill values in too-short rows" part.

@aborruso
Copy link
Contributor Author

aborruso commented Nov 12, 2023

It's not really necessary (and, as you saw, contrary to your expectations) to do the "fill values in too-short rows" part.

I don't know if I understood you: but if the output is jsonl, I also think that mlr should not add anything in row 3.

Conversely, I think that if here the output is csv, which is a rectangular format, there should be unsparsify by default (this is an old proposal of mine). Using data/het/ragged.csv and running mlr --icsv --ocsv -S --ragged cat ragged.csv to have

a,b,c,4
1,2,3,
4,5,,
7,8,9,10

and not

a,b,c
1,2,3
4,5,

a,b,c,4
7,8,9,10

I would, on the contrary, introduce for the CSV format the "sparsify" option.

Thank you always for your time and kindness

@johnkerl
Copy link
Owner

I agree that the "fill in values in too-short rows" does not belong there.

As we've discussed before, doing unsparsify by default appears to require reading all data in a non-streaming way and I am unwilling to break one of Miller's fundamental features, namely, it does streaming processing whenever possible.

That said, I had an idea the other day that may make it possible to make to do unsparsify in a memory-efficient, stream-friendly way ...

@johnkerl
Copy link
Owner

@aborruso #1428 fixes the first issue.

The other (auto-unsparsify for CSV) will require another PR.

Here is the approach I've always said is unacceptable to me, because it breaks streaming-by-default:

  • If input format is CSV:
    • Insert an unsparsify -a verb before the user-provided then-chain

This is bad because unsparsify -a requires us to read all input before producing any output (e.g. what if the first 999,999,999 records have NF=5 but the 1,000,000,000th has NF=6?)

What should work:

  • If input format is CSV:
    • Modify the CSV record-reader itself to do the same thing mlr unsparsify -a would have done ...
    • ... except here the CSV record-reader has the header row to use as its list of "all field names" ...
    • ... and this is done only for CSV/TSV input wherein we can assume the header line applies for the entire input file (which isn't the case for JSON, JSONL, DKVP, NIDX, XTAB ...)

@johnkerl
Copy link
Owner

... correction: the above is what we already had before #1428. My proposal above would simply put that back in.

Perhaps we could auto-unsparsify on CSV output ... 🤔

@johnkerl
Copy link
Owner

@aborruso let's talk about what we expect.

Some sample data:

$ cat ragged.csv
a,b,c
1,2
4,5,6

$ cat ragged.json
[
  {
    "a": 1,
    "b": 2
  },
  {
    "a": 4,
    "b": 5,
    "c": 6
  }
]

Status quo:

$ mlr --csv cat ragged.csv
mlr: mlr: CSV header/data length mismatch 3 != 2 at filename ragged.csv row 2.

$ mlr --ragged --csv cat ragged.csv
a,b
1,2

a,b,c
4,5,6

$ mlr --ragged --icsv --ojsonl cat ragged.csv
{"a": 1, "b": 2}
{"a": 4, "b": 5, "c": 6}

$ mlr --json cat ragged.json
[
{
  "a": 1,
  "b": 2
},
{
  "a": 4,
  "b": 5,
  "c": 6
}
]

$ mlr --ijson --ocsv cat ragged.json
a,b
1,2

a,b,c
4,5,6

My proposal:

# NO CHANGE:
$ mlr --csv cat ragged.csv
mlr: mlr: CSV header/data length mismatch 3 != 2 at filename ragged.csv row 2.
.

# CHANGE:
$ mlr --ragged --csv cat ragged.csv
a,b,c
1,2,
4,5,6

# NO CHANGE:
$ mlr --ragged --icsv --ojsonl cat ragged.csv
{"a": 1, "b": 2}
{"a": 4, "b": 5, "c": 6}

# NO CHANGE:
$ mlr --json cat ragged.json
[
{
  "a": 1,
  "b": 2
},
{
  "a": 4,
  "b": 5,
  "c": 6
}
]

# CHANGE:
$ mlr --ijson --ocsv cat ragged.json
a,b,c
1,2,
4,5,6

@aborruso
Copy link
Contributor Author

@johnkerl thank for your time and for your ideas. I like your proposals.

A last question. I did not understand whether starting from this

a,b,c
a,3,"lorem, ipsum",5
a,b
a,3,f,5,c

and running mlr --icsv --ojsonl -S -N --ragged cat input.csv, I still will get this, with "3" field null value at row three?

{"1": "a", "2": "b", "3": "c"}
{"1": "a", "2": "3", "3": "lorem, ipsum", "4": "5"}
{"1": "a", "2": "b", "3": ""}
{"1": "a", "2": "3", "3": "f", "4": "5"}

Thank you

@johnkerl
Copy link
Owner

@aborruso already as of #1428 (which has been merged) we have

$ mlr --icsv --ojsonl -S -N --ragged cat input.csv
{"1": "a", "2": "b", "3": "c"}
{"1": "a", "2": "3", "3": "lorem, ipsum", "4": "5"}
{"1": "a", "2": "b"}
{"1": "a", "2": "3", "3": "f", "4": "5", "5": "c"}

@johnkerl
Copy link
Owner

johnkerl commented Nov 20, 2023

The next question I'm asking is requirements for CSV auto-unsparsify -- my proposal is that we should do it on output, always, for CSV. (My doing it on CSV input, if --ragged, is what got us here.)

@aborruso
Copy link
Contributor Author

my proposal is that we should do it on output, always, for CSV.

I love it, thank you very much

@johnkerl
Copy link
Owner

@aborruso I think you'll be happy with #1479 ... please let me know ...

@aborruso
Copy link
Contributor Author

John I think it's a great thing for all users, especially the new users, because the CSV files are rectangular, then it's great to have Miller create by default a right CSV file.

Thank you for all the things you do for us.

@aborruso
Copy link
Contributor Author

@johnkerl I have a unsparsify related question.

I have this input csv:

id,Year,Neighbourhood_name,Category,Gender,Amount
1,2019,Emilstorp,0-5 years,Male,15
2,2019,Emilstorp,6-15 years,Female,25
3,2021,Emilstorp,0-5 years,Male,20

If I run

mlr --csv cut -x -f Gender then reshape -s Category,Amount input.csv

I have this error:

mlr: CSV schema change: first keys "id,Year,Neighbourhood_name,0-5 years"; current keys "id,Year,Neighbourhood_name,6-15 years"
mlr: exiting due to data error.

It's a wrong reshape, because I must cut Gender and id, but If I change format --c2m, I have no error, probably because the unsparsify command is not forced.

So it is probably okay to have that error, but for the non-expert user it is a message that does not help to find the solution. What do you think about? I'm not making any suggestions for improvement, because I don't have any at the moment.

Thank you

@johnkerl
Copy link
Owner

johnkerl commented Feb 13, 2024

New issue: #1495

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants