-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't use ${field_name}
if it contains UTF-8 characters also encodeable as Latin-1
#1358
Comments
It does not work also with 6.8.0 |
@clemente I'll take a look. Miller as of version 6 (the Go port) is UTF-8 throughout -- the Go language's support is great here, and Miller offers assurances about UTF-8 handling. There is some ad-hoc Latin-1 support: see #954, #957, #997. I'll see what I can do -- the parse error comes from the GoGGL parser generator which is highly non-trivial software. I'll be unlikely to patch GoGGL. What may work, though, is feeding Latin-1 characters in DSL input strings through a conversion from Latin-1 to UTF-8, so that when GoGGL gets the string to parse it, it will work. But then we'd need to also convert full record data from Latin-1 to UTF-8 as well, so that |
Hi @johnkerl probably I'm making a stupid assumption. In Then is this a Latin-1 source problem? |
@johnkerl @aborruso My data is just UTF-8: terminal, CSV file, bash scripts. I'm not using ISO-8859-1. When I refer to Latin-1, I mean, approximately, „European accented characters“ like
|
${field_name}
if it contains UTF-8 characters encodeable as Latin-1
${field_name}
if it contains UTF-8 characters encodeable as Latin-1${field_name}
if it contains UTF-8 characters also encodeable as Latin-1
Here is a repro. Modern shells, and Miller, and many other tools, handle UTF-8 natively so it isn't hard to generate data:
And here I am seeing exactly what you are describing:
Since Miller handles UTF-8 natively, you could say simply My suspicion is that I went not far enough on #954 and #957:
At any rate, this is definitely a bug with how Miller handles certain UTF-8 field names within |
With #1363:
|
@clemente you can use this at head now. Or, it will be in 6.9.0 (upcoming -- probably a week or two away). |
@johnkerl Thanks! After your change, the examples in the description work. My scripts that used to work in Miller 5.10.0 work again, without any modification. |
I would expect being able to quote any field name in curly brackets
${…}
. Right now it's inconsistent:{}
even when there are non-Latin-1 characters like Chinese charactersá
:Example:
The error I see in the last examples is:
This happens with mlr 6.6.0. In 5.10.0 this worked fine. I couldn't test a newer version. Tested on GNU/Linux, en_US.UTF-8 locale, from Unicode terminal and from scripts.
My end goal is to use a field called
%año
. Due to to%
I need the curly braces (${%año}
), but due to theñ
it doesn't work anymore in 6.6.0.The text was updated successfully, but these errors were encountered: