Can't use `${field_name}` if it contains UTF-8 characters also encodeable as Latin-1 #1358

clemente · 2023-08-19T07:26:33Z

I would expect being able to quote any field name in curly brackets ${…}. Right now it's inconsistent:

I can use {} even when there are non-Latin-1 characters like Chinese characters
but I can't when there are Latin-1 characters like á:

Example:

echo "These work:"
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${c三} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${ĉa} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${ca} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${cш} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${c#} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${c€} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${c%} = 3'
echo "None of these work:"
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${cá} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${cß} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${cá%} = 3'

The error I see in the last examples is:

$ (echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${cá} = 3'
mlr: cannot parse DSL expression.
Parse error on token "${cá" at line 1 column 1.
Please check for missing semicolon.
Expected one of:
  $ ; { unset filter print printn eprint eprintn dump edump tee emitf emit1
  emit ( emitp field_name $[ braced_field_name $[[ $[[[ full_srec oosvar_name
  @[ braced_oosvar_name full_oosvar all non_sigil_name arr bool float int
  map num str var funct + - .+ .- ! ~ string_literal regex_case_insensitive
  int_literal float_literal boolean_literal null_literal inf_literal nan_literal
  const_M_PI const_M_E panic [ ctx_IPS ctx_IFS ctx_IRS ctx_OPS ctx_OFS ctx_ORS
  ctx_FLATSEP ctx_NF ctx_NR ctx_FNR ctx_FILENAME ctx_FILENUM env call begin
  end if while do for break continue func subr return

This happens with mlr 6.6.0. In 5.10.0 this worked fine. I couldn't test a newer version. Tested on GNU/Linux, en_US.UTF-8 locale, from Unicode terminal and from scripts.

My end goal is to use a field called %año. Due to to % I need the curly braces (${%año}), but due to the ñ it doesn't work anymore in 6.6.0.

The text was updated successfully, but these errors were encountered:

aborruso · 2023-08-19T07:48:10Z

My end goal is to use a field called %año. Due to to % I need the curly braces (${%año}), but due to the ñ it doesn't work anymore in 6.6.0.

It does not work also with 6.8.0

johnkerl · 2023-08-19T16:21:22Z

@clemente I'll take a look.

Miller as of version 6 (the Go port) is UTF-8 throughout -- the Go language's support is great here, and Miller offers assurances about UTF-8 handling.

There is some ad-hoc Latin-1 support: see #954, #957, #997.

I'll see what I can do -- the parse error comes from the GoGGL parser generator which is highly non-trivial software. I'll be unlikely to patch GoGGL. What may work, though, is feeding Latin-1 characters in DSL input strings through a conversion from Latin-1 to UTF-8, so that when GoGGL gets the string to parse it, it will work. But then we'd need to also convert full record data from Latin-1 to UTF-8 as well, so that %año in the DSL string would string-match the record-data keys (both as UTF-8).

johnkerl · 2023-08-19T18:23:32Z

@clemente I don't know about the size or provenance of your data -- is it realistic to suggest you first pre-process your data from Latin-1 to UTF-8, before handing it off to Miller? (See also #997.)

aborruso · 2023-08-20T06:31:26Z

@clemente I don't know about the size or provenance of your data -- is it realistic to suggest you first pre-process your data from Latin-1 to UTF-8, before handing it off to Miller? (See also #997.)

Hi @johnkerl probably I'm making a stupid assumption.

In mlr 6.8.0 if I run (echo "a,b"; echo "1,2") | mlrgo --csv --rs lf put '${cá} = 3' I have error, because I'm using accented character "à". "à" is part of the UTF-8 character set.

Then is this a Latin-1 source problem?

clemente · 2023-08-20T08:11:37Z

@johnkerl @aborruso My data is just UTF-8: terminal, CSV file, bash scripts. I'm not using ISO-8859-1.

When I refer to Latin-1, I mean, approximately, „European accented characters“ like ñ or á. Those are the ones failing (even when codified as UTF-8), whereas non-European alphabets don't fail.
I don't know Go's internals but I'm guessing that Go or mlr detects in the background whether all characters in a certain string are codifiable as ASCII+„European accented characters“ (rather than ASCII+„more complex scripts like Chinese“) and it applies a different algorithm.

In mlr 6.8.0 if I run (echo "a,b"; echo "1,2") | mlrgo --csv --rs lf put '${cá} = 3' I have error, because I'm using accented character "à". "à" is part of the UTF-8 character set.

"à" is both in Latin-1, and in UTF-8. This type of characters are the ones that break the usage of curly braces ${field}
the characters that are not in Latin-1, but are in UTF-8, don't break the curly braces

johnkerl · 2023-08-20T15:04:55Z

Ahhhh thanks @aborruso and @clemente !! That makes sense. I'll take a look -- I may know what needs doing here.

johnkerl · 2023-08-20T15:26:30Z

Here is a repro.

Modern shells, and Miller, and many other tools, handle UTF-8 natively so it isn't hard to generate data:

$ cat datos-plurilingües.csv
año,ποσότητα
2020,100
2021,130
2022,145

And here I am seeing exactly what you are describing:

$ mlr --c2p filter '$año > 2020' datos-plurilingües.csv
año  ποσότητα
2021 130
2022 145

$ mlr --c2p filter '${año} > 2020' datos-plurilingües.csv
mlr: cannot parse DSL expression.
Parse error on token "${añ" at line 1 column 1.
Please check for missing semicolon.
Expected one of:
  ␚ ; { unset filter print printn eprint eprintn dump edump tee emitf emit1
  emit ( emitp field_name $[ braced_field_name $[[ $[[[ full_srec oosvar_name
  @[ braced_oosvar_name full_oosvar all non_sigil_name arr bool float int
  map num str var funct + - .+ .- ! ~ string_literal regex_case_insensitive
  int_literal float_literal boolean_literal null_literal inf_literal nan_literal
  const_M_PI const_M_E panic [ ctx_IPS ctx_IFS ctx_IRS ctx_OPS ctx_OFS ctx_ORS
  ctx_FLATSEP ctx_NF ctx_NR ctx_FNR ctx_FILENAME ctx_FILENUM env call begin
  end if while do for break continue func subr return

$ mlr --c2p filter '$ποσότητα > 100' datos-plurilingües.csv
año  ποσότητα
2021 130
2022 145

$ mlr --c2p filter '${ποσότητα} > 100' datos-plurilingües.csv
año  ποσότητα
2021 130
2022 145

Since Miller handles UTF-8 natively, you could say simply $año instead of ${año} -- except for the fact that your data has %año so you do need the curly braces due to the %, so you need ${%año}.

My suspicion is that I went not far enough on #954 and #957:

At any rate, this is definitely a bug with how Miller handles certain UTF-8 field names within ${...}.

johnkerl · 2023-08-20T15:50:37Z

With #1363:

$ cat test/input/datos-plurilingües.csv
año,ποσότητα
2020,100
2021,130
2022,145

$ mlr --c2p filter '$año > 2020' test/input/datos-plurilingües.csv
año  ποσότητα
2021 130
2022 145

$ mlr --c2p filter '${año} > 2020' test/input/datos-plurilingües.csv
año  ποσότητα
2021 130
2022 145

$ mlr --c2p filter '$ποσότητα > 100' test/input/datos-plurilingües.csv
año  ποσότητα
2021 130
2022 145

$ mlr --c2p filter '${ποσότητα} > 100' test/input/datos-plurilingües.csv
año  ποσότητα
2021 130
2022 145

johnkerl · 2023-08-20T16:20:49Z

@clemente you can use this at head now. Or, it will be in 6.9.0 (upcoming -- probably a week or two away).

clemente · 2023-08-21T11:40:58Z

@johnkerl Thanks! After your change, the examples in the description work. My scripts that used to work in Miller 5.10.0 work again, without any modification.

johnkerl self-assigned this Aug 19, 2023

johnkerl added bug and removed bug labels Aug 19, 2023

johnkerl added the pending feedback to close label Aug 19, 2023

johnkerl removed the pending feedback to close label Aug 20, 2023

johnkerl changed the title ~~Can't use curly braces around field name if it contains Latin-1 characters~~ Can't use ${field_name} if it contains UTF-8 characters encodeable as Latin-1 Aug 20, 2023

johnkerl changed the title ~~Can't use ${field_name} if it contains UTF-8 characters encodeable as Latin-1~~ Can't use ${field_name} if it contains UTF-8 characters also encodeable as Latin-1 Aug 20, 2023

johnkerl added bug active labels Aug 20, 2023

johnkerl mentioned this issue Aug 20, 2023

Can't use ${field_name} if it contains UTF-8 characters also encodeable as Latin-1 #1363

Merged

johnkerl closed this as completed in #1363 Aug 20, 2023

johnkerl removed the active label Aug 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't use `${field_name}` if it contains UTF-8 characters also encodeable as Latin-1 #1358

Can't use `${field_name}` if it contains UTF-8 characters also encodeable as Latin-1 #1358

clemente commented Aug 19, 2023

aborruso commented Aug 19, 2023

johnkerl commented Aug 19, 2023 •

edited

Loading

johnkerl commented Aug 19, 2023 •

edited

Loading

aborruso commented Aug 20, 2023

clemente commented Aug 20, 2023

johnkerl commented Aug 20, 2023 •

edited

Loading

johnkerl commented Aug 20, 2023 •

edited

Loading

johnkerl commented Aug 20, 2023

johnkerl commented Aug 20, 2023

clemente commented Aug 21, 2023

Can't use ${field_name} if it contains UTF-8 characters also encodeable as Latin-1 #1358

Can't use ${field_name} if it contains UTF-8 characters also encodeable as Latin-1 #1358

Comments

clemente commented Aug 19, 2023

aborruso commented Aug 19, 2023

johnkerl commented Aug 19, 2023 • edited Loading

johnkerl commented Aug 19, 2023 • edited Loading

aborruso commented Aug 20, 2023

clemente commented Aug 20, 2023

johnkerl commented Aug 20, 2023 • edited Loading

johnkerl commented Aug 20, 2023 • edited Loading

johnkerl commented Aug 20, 2023

johnkerl commented Aug 20, 2023

clemente commented Aug 21, 2023

Can't use `${field_name}` if it contains UTF-8 characters also encodeable as Latin-1 #1358

Can't use `${field_name}` if it contains UTF-8 characters also encodeable as Latin-1 #1358

johnkerl commented Aug 19, 2023 •

edited

Loading

johnkerl commented Aug 19, 2023 •

edited

Loading

johnkerl commented Aug 20, 2023 •

edited

Loading

johnkerl commented Aug 20, 2023 •

edited

Loading