JSON to CSV Error #1534

osevill · 2024-03-25T03:40:07Z

In previous versions mlr --j2c group-like worked fine.
In version 6.12 the same script with the same json file gives an error:
mlr: CSV schema change: first keys "..."; current keys "..."

The fields/properties do change from one object element of the json array to the next, but that's why I use group-like.

Thanks

The text was updated successfully, but these errors were encountered:

aborruso · 2024-03-25T06:21:16Z

Could you share a sample file and a sample command?

osevill · 2024-03-25T16:59:51Z

given this sample file:
[{"name":"Rixos The Palm Dubai","location_1":[{"lat":25.1212},{"long":55.1535}],"field_1":1},{"name":"Shangri-La Hotel","location_1":[{"lat":25.2084},{"long":55.2719}]},{"name":"Grand Hyatt","location_1":[{"lat":25.2285},{"long":55.3273}],"field_1":1,"field_2":2,"field_3":3}]

if I run mlr --j2c group-like sample.csv using ver 6.12 (or 6.11)
I get this:

name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1
Shangri-La Hotel,25.2084,55.2719,
Grand Hyatt,25.2285,55.3273,1,2,3

even though each object element has slightly different fields, where I would expect a new csv header row each time.

if I change the source file so that the nested array names differ from element to element of the outer array:
[{"name":"Rixos The Palm Dubai","location_1":[{"lat":25.1212},{"long":55.1535}],"field_1":1},{"name":"Shangri-La Hotel","location_2":[{"lat":25.2084},{"long":55.2719}]},{"name":"Grand Hyatt","location_3":[{"lat":25.2285},{"long":55.3273}],"field_1":1,"field_2":2,"field_3":3}]
I get this error:

mlr: CSV schema change: first keys "name,location_1.1.lat,location_1.2.long,field_1"; current keys "name,location_2.1.lat,location_2.2.long"
name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1
mlr: exiting due to data error.

in version 6.10, this works as expected:

name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1

name,location_1.1.lat,location_1.2.long
Shangri-La Hotel,25.2084,55.2719

name,location_1.1.lat,location_1.2.long,field_1,field_2,field_3
Grand Hyatt,25.2285,55.3273,1,2,3

and

name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1

name,location_2.1.lat,location_2.2.long
Shangri-La Hotel,25.2084,55.2719

name,location_3.1.lat,location_3.2.long,field_1,field_2,field_3
Grand Hyatt,25.2285,55.3273,1,2,3

Also, if I use --j2p (instead of --j2c) in 6.12, it seems to work fine however.

Thanks.

aborruso · 2024-03-25T18:33:00Z

Hi @osevill it's probably related to flatten/unflatten, but I do not have a solution for you.

@johnkerl will be able to help you.

johnkerl · 2024-03-26T13:47:18Z

@osevill the 6.11.0 release (https://github.com/johnkerl/miller/releases/tag/v6.11.0) contains PR #1479 which addresses issue #1418.

Before this, Miller was in some cases producing non-compliant CSV output:

$ cat i.j
[
  { "a": 1 },
  { "b": 2, "c": 3 }
]

$ mlr-6.10.0 --j2c cat i.j
a
1

b,c
2,3

After this, Miller now produces compliant CSV output, or says that it can't:

$ mlr-6.11.0 --j2c cat i.j
a
1
mlr: CSV schema change: first keys "a"; current keys "b,c"
mlr: exiting due to data error.

If one row's list of column names is a strict subset of the others it can auto-unsparsify:

$ cat k.j
[
  { "a": 1, "b": 2 },
  { "a": 3 }
]

$ mlr-6.10.0 --j2c cat k.j
a,b
1,2

a
3

$ mlr-6.11.0 --j2c cat k.j
a,b
1,2
3,

The concern raised by issue #1418, and addressed by PR #1479, is that Miller should stop producing "CSV" with non-compliant blank lines in it. @aborruso was right to request to #1418.

For the data files in this issue, the records are truly non-homogeneous and are truly not representable as compliant CSV.

Two options I can suggest:

Use csvlite output format (https://miller.readthedocs.io/en/latest/file-formats/#csvtsvasvusvetc)
- This doesn't claim to comply with RFC4180
- It allows non-homogeneous records, separated with line breaks, which is a good match for the kind of output data you want to obtain
- mlr --ijson --ocsvlite group-like i.j
Use unsparsify to obtain compliant CSV
- mlr --ijson --ocsv group-like then flatten then unsparsify i.j

aborruso · 2024-03-26T13:53:08Z

mlr --ijson --ocsv group-like then flatten then unsparsify i.j

I was looking for this, but each time I am unable to reconstruct it. Thank you @johnkerl

osevill · 2024-03-29T21:50:07Z

@johnkerl The feedback above is great help, particularly ...group-like then flatten then unsparsify to have one unique header row with fields from all json array elements.
But I find it useful sometimes to see csv records in groups, by distinct header, which is what your first suggestion does. You're correct that the sample I provided has no commas in the data fields and so csvlite works, but I then realized that my actual data does sometimes have commas inside double-quoted values, which is why the --j2c option worked well for me prior to 6.11.

Since my json data does have commas in the values, here's what I came up with...

Given this sample file (this time with commas in the values):
[{"name":"Rixos,The,Palm,Dubai","location":[{"lat":25.1212},{"long":55.1535}],"field_1":1},{"name":"Shangri,La,Hotel","location_2":[{"lat":25.2084},{"long":55.2719}]},{"name":"Grand,Hyatt","location_3":[{"lat":25.2285},{"long":55.3273}],"field_1":1,"field_2":2,"field_3":3}]

...if I convert from json to tsvlite when doing the group-like, I avoid field breaks after each comma in the value. (I'm assuming my data will not have tabs in the values, which has been the case so far...or embedded newlines); then I do a separate mlr cat just to convert from tab-delimiters to a symbol delimiter of my choice (so that the delimiter is a printable character). At this point, I tell whatever software I'm using that the delimiter is my symbol, and I'm good:

mlr --ijson --otsvlite --from ./sample_json_array.json group-like | mlr --itsvlite --ocsvlite --ofs '•' cat > ./output_file.csv

Is there a simpler way to change the delimiter than calling mlr again and changing the output field separator? tsvlite doesn't seem to support changing the output field separator.

Don't know your thoughts on this but would it be worthwhile to have a new file format that is "in between" csvlite and csv, in the sense that it would be csvlite + support for commas or newlines embedded in double quotes, but because it wouldn't adhere to the RFC4180 spec, it would allow for blank rows in the output?

In this particular instance, it would assist me in accomplishing my json to row-based transform in just one mlr group-like.

Thanks again for the feedback.

johnkerl · 2024-03-30T02:18:05Z

@osevill the best option would be to restore the behavior of #1479, default off, only behind a new opt-in flag -- I can do this, no problem.

osevill · 2024-03-30T17:37:41Z

Thanks!
So the proposed behavior would be to continue to auto-unsparsify all header fields of csv/tsv output, but with a new optional flag to not auto-unsparsify (which would make mlr --j2c group-like with the new flag work like it did in 6.10)?

johnkerl self-assigned this Jun 9, 2024

johnkerl added the active label Jun 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON to CSV Error #1534

JSON to CSV Error #1534

osevill commented Mar 25, 2024

aborruso commented Mar 25, 2024

osevill commented Mar 25, 2024

aborruso commented Mar 25, 2024

johnkerl commented Mar 26, 2024 •

edited

Loading

aborruso commented Mar 26, 2024

osevill commented Mar 29, 2024

johnkerl commented Mar 30, 2024 •

edited

Loading

osevill commented Mar 30, 2024

JSON to CSV Error #1534

JSON to CSV Error #1534

Comments

osevill commented Mar 25, 2024

aborruso commented Mar 25, 2024

osevill commented Mar 25, 2024

aborruso commented Mar 25, 2024

johnkerl commented Mar 26, 2024 • edited Loading

aborruso commented Mar 26, 2024

osevill commented Mar 29, 2024

johnkerl commented Mar 30, 2024 • edited Loading

osevill commented Mar 30, 2024

johnkerl commented Mar 26, 2024 •

edited

Loading

johnkerl commented Mar 30, 2024 •

edited

Loading