-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV to JSONL: wrong conversion? #1418
Comments
Using miller 5 I get the result that feels right to me (
|
Why I do not have the fifth field in in jsonl row 4? -- this is an embarrassing bug. There wasn't a loop where there should have been, so at most one extra column was handled. This was wrong and I will fix it. Why I have 3 fields in jsonl row 3? -- this was by intentional design, although maybe that wasn't a good design decision. Namely: https://miller.readthedocs.io/en/latest/record-heterogeneity/#ragged-data:
It's not really necessary (and, as you saw, contrary to your expectations) to do the "fill values in too-short rows" part. |
I don't know if I understood you: but if the output is jsonl, I also think that mlr should not add anything in row 3. Conversely, I think that if here the output is csv, which is a rectangular format, there should be unsparsify by default (this is an old proposal of mine). Using
and not
I would, on the contrary, introduce for the CSV format the "sparsify" option. Thank you always for your time and kindness |
I agree that the "fill in values in too-short rows" does not belong there. As we've discussed before, doing unsparsify by default appears to require reading all data in a non-streaming way and I am unwilling to break one of Miller's fundamental features, namely, it does streaming processing whenever possible. That said, I had an idea the other day that may make it possible to make to do unsparsify in a memory-efficient, stream-friendly way ... |
@aborruso #1428 fixes the first issue. The other (auto-unsparsify for CSV) will require another PR. Here is the approach I've always said is unacceptable to me, because it breaks streaming-by-default:
This is bad because What should work:
|
... correction: the above is what we already had before #1428. My proposal above would simply put that back in. Perhaps we could auto-unsparsify on CSV output ... 🤔 |
@aborruso let's talk about what we expect. Some sample data:
Status quo:
My proposal:
|
@johnkerl thank for your time and for your ideas. I like your proposals. A last question. I did not understand whether starting from this
and running
Thank you |
The next question I'm asking is requirements for CSV auto-unsparsify -- my proposal is that we should do it on output, always, for CSV. (My doing it on CSV input, if |
I love it, thank you very much |
John I think it's a great thing for all users, especially the new users, because the CSV files are rectangular, then it's great to have Miller create by default a right CSV file. Thank you for all the things you do for us. |
@johnkerl I have a unsparsify related question. I have this input csv:
If I run
I have this error:
It's a wrong reshape, because I must cut Gender and id, but If I change format So it is probably okay to have that error, but for the non-expert user it is a message that does not help to find the solution. What do you think about? I'm not making any suggestions for improvement, because I don't have any at the moment. Thank you |
New issue: #1495 |
Hi,
I have this input
If I run
I get
Two questions:
Thank you
The text was updated successfully, but these errors were encountered: