Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't use ${field_name} if it contains UTF-8 characters also encodeable as Latin-1 #1358

Closed
clemente opened this issue Aug 19, 2023 · 10 comments · Fixed by #1363
Closed

Can't use ${field_name} if it contains UTF-8 characters also encodeable as Latin-1 #1358

clemente opened this issue Aug 19, 2023 · 10 comments · Fixed by #1363
Assignees
Labels

Comments

@clemente
Copy link

I would expect being able to quote any field name in curly brackets ${…}. Right now it's inconsistent:

  • I can use {} even when there are non-Latin-1 characters like Chinese characters
  • but I can't when there are Latin-1 characters like á:

Example:

echo "These work:"
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${c三} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${ĉa} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${ca} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${cш} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${c#} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${c€} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${c%} = 3'
echo "None of these work:"
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${cá} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${cß} = 3'
(echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${cá%} = 3'

The error I see in the last examples is:

$ (echo "a,b"; echo "1,2") | mlr --csv --rs lf put '${cá} = 3'
mlr: cannot parse DSL expression.
Parse error on token "${cá" at line 1 column 1.
Please check for missing semicolon.
Expected one of:
  $ ; { unset filter print printn eprint eprintn dump edump tee emitf emit1
  emit ( emitp field_name $[ braced_field_name $[[ $[[[ full_srec oosvar_name
  @[ braced_oosvar_name full_oosvar all non_sigil_name arr bool float int
  map num str var funct + - .+ .- ! ~ string_literal regex_case_insensitive
  int_literal float_literal boolean_literal null_literal inf_literal nan_literal
  const_M_PI const_M_E panic [ ctx_IPS ctx_IFS ctx_IRS ctx_OPS ctx_OFS ctx_ORS
  ctx_FLATSEP ctx_NF ctx_NR ctx_FNR ctx_FILENAME ctx_FILENUM env call begin
  end if while do for break continue func subr return

This happens with mlr 6.6.0. In 5.10.0 this worked fine. I couldn't test a newer version. Tested on GNU/Linux, en_US.UTF-8 locale, from Unicode terminal and from scripts.

My end goal is to use a field called %año. Due to to % I need the curly braces (${%año}), but due to the ñ it doesn't work anymore in 6.6.0.

@aborruso
Copy link
Contributor

My end goal is to use a field called %año. Due to to % I need the curly braces (${%año}), but due to the ñ it doesn't work anymore in 6.6.0.

It does not work also with 6.8.0

@johnkerl johnkerl self-assigned this Aug 19, 2023
@johnkerl johnkerl added bug and removed bug labels Aug 19, 2023
@johnkerl
Copy link
Owner

johnkerl commented Aug 19, 2023

@clemente I'll take a look.

Miller as of version 6 (the Go port) is UTF-8 throughout -- the Go language's support is great here, and Miller offers assurances about UTF-8 handling.

There is some ad-hoc Latin-1 support: see #954, #957, #997.

I'll see what I can do -- the parse error comes from the GoGGL parser generator which is highly non-trivial software. I'll be unlikely to patch GoGGL. What may work, though, is feeding Latin-1 characters in DSL input strings through a conversion from Latin-1 to UTF-8, so that when GoGGL gets the string to parse it, it will work. But then we'd need to also convert full record data from Latin-1 to UTF-8 as well, so that %año in the DSL string would string-match the record-data keys (both as UTF-8).

@johnkerl
Copy link
Owner

johnkerl commented Aug 19, 2023

@clemente I don't know about the size or provenance of your data -- is it realistic to suggest you first pre-process your data from Latin-1 to UTF-8, before handing it off to Miller? (See also #997.)

@aborruso
Copy link
Contributor

@clemente I don't know about the size or provenance of your data -- is it realistic to suggest you first pre-process your data from Latin-1 to UTF-8, before handing it off to Miller? (See also #997.)

Hi @johnkerl probably I'm making a stupid assumption.

In mlr 6.8.0 if I run (echo "a,b"; echo "1,2") | mlrgo --csv --rs lf put '${cá} = 3' I have error, because I'm using accented character "à". "à" is part of the UTF-8 character set.

Then is this a Latin-1 source problem?

@clemente
Copy link
Author

@johnkerl @aborruso My data is just UTF-8: terminal, CSV file, bash scripts. I'm not using ISO-8859-1.

When I refer to Latin-1, I mean, approximately, „European accented characters“ like ñ or á. Those are the ones failing (even when codified as UTF-8), whereas non-European alphabets don't fail.
I don't know Go's internals but I'm guessing that Go or mlr detects in the background whether all characters in a certain string are codifiable as ASCII+„European accented characters“ (rather than ASCII+„more complex scripts like Chinese“) and it applies a different algorithm.

In mlr 6.8.0 if I run (echo "a,b"; echo "1,2") | mlrgo --csv --rs lf put '${cá} = 3' I have error, because I'm using accented character "à". "à" is part of the UTF-8 character set.

  • "à" is both in Latin-1, and in UTF-8. This type of characters are the ones that break the usage of curly braces ${field}
  • the characters that are not in Latin-1, but are in UTF-8, don't break the curly braces

@johnkerl johnkerl changed the title Can't use curly braces around field name if it contains Latin-1 characters Can't use ${field_name} if it contains UTF-8 characters encodeable as Latin-1 Aug 20, 2023
@johnkerl johnkerl changed the title Can't use ${field_name} if it contains UTF-8 characters encodeable as Latin-1 Can't use ${field_name} if it contains UTF-8 characters also encodeable as Latin-1 Aug 20, 2023
@johnkerl
Copy link
Owner

johnkerl commented Aug 20, 2023

Ahhhh thanks @aborruso and @clemente !! That makes sense. I'll take a look -- I may know what needs doing here.

@johnkerl
Copy link
Owner

johnkerl commented Aug 20, 2023

Here is a repro.

Modern shells, and Miller, and many other tools, handle UTF-8 natively so it isn't hard to generate data:

$ cat datos-plurilingües.csv
año,ποσότητα
2020,100
2021,130
2022,145

And here I am seeing exactly what you are describing:

$ mlr --c2p filter '$año > 2020' datos-plurilingües.csv
año  ποσότητα
2021 130
2022 145
$ mlr --c2p filter '${año} > 2020' datos-plurilingües.csv
mlr: cannot parse DSL expression.
Parse error on token "${añ" at line 1 column 1.
Please check for missing semicolon.
Expected one of:
  ␚ ; { unset filter print printn eprint eprintn dump edump tee emitf emit1
  emit ( emitp field_name $[ braced_field_name $[[ $[[[ full_srec oosvar_name
  @[ braced_oosvar_name full_oosvar all non_sigil_name arr bool float int
  map num str var funct + - .+ .- ! ~ string_literal regex_case_insensitive
  int_literal float_literal boolean_literal null_literal inf_literal nan_literal
  const_M_PI const_M_E panic [ ctx_IPS ctx_IFS ctx_IRS ctx_OPS ctx_OFS ctx_ORS
  ctx_FLATSEP ctx_NF ctx_NR ctx_FNR ctx_FILENAME ctx_FILENUM env call begin
  end if while do for break continue func subr return
$ mlr --c2p filter '$ποσότητα > 100' datos-plurilingües.csv
año  ποσότητα
2021 130
2022 145
$ mlr --c2p filter '${ποσότητα} > 100' datos-plurilingües.csv
año  ποσότητα
2021 130
2022 145

Since Miller handles UTF-8 natively, you could say simply $año instead of ${año} -- except for the fact that your data has %año so you do need the curly braces due to the %, so you need ${%año}.

My suspicion is that I went not far enough on #954 and #957:

At any rate, this is definitely a bug with how Miller handles certain UTF-8 field names within ${...}.

@johnkerl
Copy link
Owner

With #1363:

$ cat test/input/datos-plurilingües.csv
año,ποσότητα
2020,100
2021,130
2022,145
$ mlr --c2p filter '$año > 2020' test/input/datos-plurilingües.csv
año  ποσότητα
2021 130
2022 145
$ mlr --c2p filter '${año} > 2020' test/input/datos-plurilingües.csv
año  ποσότητα
2021 130
2022 145
$ mlr --c2p filter '$ποσότητα > 100' test/input/datos-plurilingües.csv
año  ποσότητα
2021 130
2022 145
$ mlr --c2p filter '${ποσότητα} > 100' test/input/datos-plurilingües.csv
año  ποσότητα
2021 130
2022 145

@johnkerl
Copy link
Owner

@clemente you can use this at head now. Or, it will be in 6.9.0 (upcoming -- probably a week or two away).

@johnkerl johnkerl removed the active label Aug 20, 2023
@clemente
Copy link
Author

@johnkerl Thanks! After your change, the examples in the description work. My scripts that used to work in Miller 5.10.0 work again, without any modification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants