Creating a blacklisting certain characters from variable and attribute names #323
Replies: 8 comments 3 replies
-
Surely we don't want to disallow underscores! |
Beta Was this translation helpful? Give feedback.
-
Hmm -- I like this idea. But first i think we should make clear what the (long term) goal is: Unicode is very complex, with a lot of subtleties -- There are efforts to manage that with normalization (https://www.unicode.org/reports/tr15/), and categorization of code points. (General Category. Partition of the characters into major classes such as letters, punctuation, and symbols, and further subclasses for each of the major classes.) Etc. So I think we have essentially three options:
I think the whole point of this discussion is that we don't want to do (1) anymore. for (2) -- it seems appealing, but there's a lot of complexity, e.g. (from the Unicode spec)
So it can get messing. Nevertheless, there is precedent -- for instance, Python has the following rules: https://docs.python.org/3/reference/lexical_analysis.html#identifiers A bit messy, but do-able. However, there are still a number of complications -- one is NFKC normalization, and another is that Python treats some different Unicode characters as equivalent (e.g. Blackboard Bold "B" U+1D539 is the same as capital B U+0042) -- but only in context where the normalization is done (e.g. processing source code, but not when meta-programming, like Frankly, it's a bit of a mess if people really do use the broad range of allowable characters. That being said, I think that the CF problem is easier than Python, as CF isn't providing normalization -- only enforcement. I'm inclined (at the moment -- I haven't thought it through too carefully) to go with (3) -- allow any Unicode code point except a given blacklist. Note that I say Code Point, not character, as some characters can be represented by different code points (e.g. accented characters) If we simply do "Code Point", then there is no issue of normalization, or anything else. (hmm, option 3(b) -- any code point, but a particular normalization?) Though maybe that's too much a wild west? |
Beta Was this translation helpful? Give feedback.
-
I'd like to course-correct the discussion a bit, if I may. This is not a proposal to expand the list of allowed characters in a wide-reaching way. That's what #237 is about, and a number of folks (including me and Lars) concluded that it would be unwise; there are a lot of security and interoperability concerns that make it important to consider any expansions of the list carefully and cautiously before adding them. I believe what Lars is proposing is that we add an explicit, stand-alone listing of the sets of banned and allowed characters, rather than only having them defined implicitly in the text of section 3.2. I can see the value in that, but I think we shouldn't frame it as a list of banned characters, because that implies that anything not on the list is allowed, and as discussed in #237, there are important reasons that the default answer for whether a character is allowed should be "no". I think we should have an explicit list of allowed characters, with an accompanying list (maybe an extra column) of clarifications to cover the known disallowed characters that Lars suggests. So maybe something like:
|
Beta Was this translation helpful? Give feedback.
-
@sethmcg: sorry about that -- I think it was me that expanded the conversation. However, the reason I did that is that I don't see how we can talk about a blacklist without the context of what's allowed, so I was trying to get at that. However, I think maybe I get it now -- this proposal for a "blacklist" is more internal to clearly define the rules now, and to guide any potential expansion in the future -- e.g.: whatever we do we won't allow THESE charactors :-) I see the point of that, so carry on :-) To that point:
I find this odd to say -- are ANY other non-ascii charactors -- any number of other symbols, punctuation, etc allowed? I think I get the point here, but it's a odd phrasing. I think the point is that folks may be tempted to (or accidentally) use another symbol that "looks like" a dash. In fact, I've had that issue in a totally different context, where something was copy and pasted from an application that had (helpfully) auto-changed an ascii dash to an endash. So I don't see this as a blacklist so much as a "be cautious of these" list -- at least in that example. Which I do think is good to document. The real blacklist are the ones that will break other aspects of CF / netcdf (e.g. have special meaning in CDL) -CHB |
Beta Was this translation helpful? Give feedback.
-
I hadn't thought about compiling the list of characters that we definitely don't want to add for various technical reasons, just to have a consolidated reference for what they are and why they're banned. I agree that that would be a very useful thing to have, but I'm not sure about adding it to CF proper. I worry that people would see it and think of it as the complete list of all disallowed characters, and that everything else is allowed. Maybe we want to have that list, but make it an adjunct document of some kind, like the Guidelines for Constructing Standard Names? Or put it in an appendix? |
Beta Was this translation helpful? Give feedback.
-
Wow, I was away from this issue for a few days while there have been a lot of activity and good points. When opening this discussion I had in mind was a rather modest extension to section 2.3, where the relevant part reads
Essentially this allows, as a recommendation, the US-ASCII (or their Unicode counterpart) letters and digits and underscore, as well as period and hyphen for attribute names. All other characters are implicitly not recommended (or "should not"), but not explicitly excluded or forbidden. What I had in mind was to marginally reduce this huge list of not recommended characters by explicitly disallowing the few characters that we already now know will create problems. So far I am aware of the following, all within the US-ASCII character set, control characters (decimal 0 ... 31, 127), Based on this, my simplistic suggestion is to immediately after the text cited above add a sentence, something like
In this minimal way we avoid all complications in relation to Unicode, and focus on those few we all agree, I think, cannot be used. All other punctuation (whether ASCII or Unicode), Unicode control and what not, remains as is, which basically means to be sorted out in the future. |
Beta Was this translation helpful? Give feedback.
-
I have now explored this in some more detail using a python script to insert various unicode characters into the variable name in a small .cdl file and then use ncgen to generate a .nc file. In the same script I used NCO/ncrename trying to change the same character of a variable name in a working nc-file to all other characters in the list, and then use ncdump to create a cdl file. Thus it is not a full round-trip because the NCO step. I focussed on ASCII (decimal 0 - 127), ISO/IEC 8859-1 (decimal 0 - 255) and control (C1), as well as Unicode whitespace (WS) groups (all according to Wikipedia). Here is the result:
In doing this I used the most recent released version of the netCDF library tools (netcdf library version 4.9.2 of Jun 6 2024 10:57:38). With respect to ASCII, I think that this is a pretty strong indication of which characters (groups) should not be not accepted in variable and attribute names. And, yes, I do think that it better to be explicit about this and expressly rule out those characters we know are likely to cause problems because the CF conventions are all about data exchange and interoperability. I think that it would be good to get such a statement into CF-1.12, what do you think ? ping @sethmcg @ChrisBarker-NOAA @JonathanGregory @ethanrd @Dave-Allured @DocOtak @davidhassell |
Beta Was this translation helpful? Give feedback.
-
Dear @larsbarring et al. Thanks for your thorough investigation, Lars, and thanks everyone for the discussion. The text which Lars quoted above is not the working version. Following conventions issue #237, section 2.3 now reads
which is consistent with the conformance document. That is, as Lars says, we recommend against a lot of characters. All characters except letters, digits, underscores and (for attributes only) ASCII 2D In the discussion of conventions #237 we agreed that all characters are allowed, despite the recommendation (which is not a requirement) not to use the majority of them. Lars commented that the CF conventions "essentially provide a whitelist of explicitly allowed characters. All other characters are not recommended (or recommended against) but not explicitly disallowed. But throughout this conversation there have been several remarks that some characters should indeed be explicitly disallowed. This could easily be done by ... creating a blacklist." That's what this discussion is about, if I understand correctly. The last sentence of the working text as above is unsatisfactory, despite #237, because it says The working text is also unsatisfactory because it implies that the NUG prohibits some characters ("it allows almost all Unicode characters ...") but it doesn't say which ones are not allowed. NUG Appendix B says that names should match the regular expression
I suppose we should understand the regular expression to begin with Since ASCII is a subset of UTF-8, I think that by "multibyte UTF-8 encoded", the NUG must mean a Unicode character which is encoded in more than one byte by UTF-8. That is, MUTF8 doesn't include one-byte characters, among them the ASCII characters 00-7F. Do you agree? If that's correct, the NUG does not allow I think we should explicitly state that we prohibit 00-1F, Also, the the CF working text is inconsistent with the NUG in saying "It is recommended that variable, dimension, attribute and group names begin with a letter". This is not merely a recommendation, because the NUG says that names must begin with a letter, digit, underscore or multi-byte UTF-8 character. We should fix this. Our text currently implies it's OK to start a name with a punctuation mark, for instance, which the NUG prohibits. Lars's experiment shows that I think it would be reasonable for CF to prohibit all those characters which We've decided to allow Best wishes Jonathan |
Beta Was this translation helpful? Give feedback.
-
Topic for discussion
In #237 it was suggested to substantially relax restrictions on which characters are allowed in variable and attribute names. The conversation is still ongoing and sprinkled in various comments there are examples of characters that should not be allowed, either because they have special meaning in the context of CF or netCDF as such, or otherwise identified as causing problems.
I suggest that we amend the text in section 2.3 to list which character and character ranges CF explicitly disallows, i.e. creating a blacklist. While it may not be possible to identify all characters that should be in such a list (it may even evolve over time) I think that it is helpful to identify those characters that we now know belong to such a list.
So, far I believe the following have been identified from the standard ASCII character set: <
space
>, control characters (decimal 0 ... 31, 127),/
,:
,\
. This blacklist should probably be expanded to also include Unicode control and whitespaceand underscorecharacters.I addition, double underscores
__
have special meaning in relation to OGC netCDF-LD, specifically for prefixes, and should be mentioned as reserved for that purpose to not create interoperability clashes.Beta Was this translation helpful? Give feedback.
All reactions