Double-quote-respecting Parser for NIDX? #1417

osevill · 2023-10-31T01:59:26Z

osevill
Oct 31, 2023

I would like to get some statistics on an irregular csv file, where some rows have fewer fields than others, and while the following error below is useful to know that some rows have fewer fields, the error prevents collecting statistics like printing the actual number of fields, NF, per row:
mlr: mlr: CSV header/data length mismatch 15 != 17 at filename ... row 13

(I need to identify which rows have fewer fields in order to insert blank columns, i.e., two consecutive commas, in the appropriate location in the row... not at the end of the row.)

So I'm trying to treat the csv file as number-indexed, which allows me to avoid the above error, but returns incorrect/inflated row counts because comma is my NIDX field delimiter but it also appears inside values enclosed in double quotes.

I found the discussion below from 2020 where it's proposed that the miller parser may respect double quotes for NIDX in a future version:

#369 (comment)

Did that ever happen? or is there another way to collect a row-by-row count of the number of fields per row?

Here's the mlr expression I'm using, treating the irregular csv file as nidx to avoid the mismatch error:
mlr --from ./malformedFile.csv --fs comma --nidx put '$RowCount = NR; $FieldCount = NF' then cut -f RowCount,FieldCount

Thanks!

Answered by aborruso

Nov 1, 2023

@osevill you could add row number in a second step. So you could use my script to create the output.

And at the end use awk to add row number:

awk -F',' '{print NR "," $0}' file.csv >output.csv

View full answer

aborruso · 2023-10-31T09:20:52Z

aborruso
Oct 31, 2023

Hi,
if you have in example

a,b,c
a,3,"lorem, ipsum",5
a,b
a,3,f,5,c

You could run

while read line; do 
  echo "$line" | mlr --csv -N put '$f=NF' then reorder -f f ; 
done <input.csv

to get

3,a,b,c
4,a,3,"lorem, ipsum",5
2,a,b
5,a,3,f,5,c

5 replies

johnkerl Oct 31, 2023
Maintainer

Thanks @aborruso for #1417

There is also the older issue of #266 (comment) -- namely, DKVP simply isn't double-quote aware as CSV is

This would be a great feature to have ...

osevill Nov 1, 2023
Author

@aborruso -- Thank you! This worked as-is when I tried it in zsh. Performance is slow for a file with 400k+ records, but it works. (It took a little over an hour to run the script below on an 8th-gen Intel i7.)
Two questions -- are the quotes around $line needed in the echo statement after the do? ...and what does the mlr -N parameter accomplish?

Since this solution passes 1 line at a time to mlr, I can't include NR as a field in the mlr put statement since NR is always 1. So I'm using a counter in the while loop to simulate row number. I came up with the following after doing some Stack Overflow research:

count=1; while read line; do echo "$line" | mlr --csv -N put '$fieldCount=NF' then cut -f fieldCount | {echo "$count,$(cat -)"}; ((count++)); done < ./malformedFile.csv >./mlr_output.csv

@johnkerl -- Agree that being able to do it in miller alone (without zsh scripting) would be great, particularly for performance reasons.

Thx both.

aborruso Nov 1, 2023

Two questions -- are the quotes around $line needed in the echo statement after the do? ...and what does the mlr -N parameter accomplish?

Using double quotes around variables in Bash is a good practice for several reasons:

Variable Expansion: Double quotes allow for variable expansion, meaning the variable's value will be substituted in its place.
Handling Spaces: If your variable contains spaces or special characters, quoting it with double quotes preserves its exact value. Without quotes, Bash would treat each word as a separate argument.
Security: Using quotes can prevent many kinds of code injection attacks and errors that might occur if the variable's value contains, for example, special characters.

In your specific case with $line, the variable could have spaces or other special characters. Using "$line" ensures that the line is passed as a single argument to echo, preserving any spaces or special characters that might be present.

For example, consider an input line foo bar. If you use echo $line, you'd get foo bar, but if there are additional spaces, they might be removed. With echo "$line", every space is preserved as is.

-N is the Keystroke-saver for --implicit-csv-header --headerless-csv-output. Because, if I understand correctly, your input file does not have a header line.

aborruso Nov 1, 2023

@osevill you could add row number in a second step. So you could use my script to create the output.

And at the end use awk to add row number:

awk -F',' '{print NR "," $0}' file.csv >output.csv

Answer selected by osevill

osevill Nov 2, 2023
Author

Ok, thx for the explanation, and suggestion to use awk for row numbers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double-quote-respecting Parser for NIDX? #1417

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Double-quote-respecting Parser for NIDX? #1417

osevill Oct 31, 2023

Replies: 1 comment · 5 replies

aborruso Oct 31, 2023

johnkerl Oct 31, 2023 Maintainer

osevill Nov 1, 2023 Author

aborruso Nov 1, 2023

aborruso Nov 1, 2023

osevill Nov 2, 2023 Author

osevill
Oct 31, 2023

Replies: 1 comment 5 replies

aborruso
Oct 31, 2023

johnkerl Oct 31, 2023
Maintainer

osevill Nov 1, 2023
Author

osevill Nov 2, 2023
Author