Creating a blacklisting certain characters from variable and attribute names #323

larsbarring · 2024-05-31T09:55:25Z

larsbarring
May 31, 2024
Collaborator

Topic for discussion

In #237 it was suggested to substantially relax restrictions on which characters are allowed in variable and attribute names. The conversation is still ongoing and sprinkled in various comments there are examples of characters that should not be allowed, either because they have special meaning in the context of CF or netCDF as such, or otherwise identified as causing problems.

I suggest that we amend the text in section 2.3 to list which character and character ranges CF explicitly disallows, i.e. creating a blacklist. While it may not be possible to identify all characters that should be in such a list (it may even evolve over time) I think that it is helpful to identify those characters that we now know belong to such a list.

So, far I believe the following have been identified from the standard ASCII character set: <space>, control characters (decimal 0 ... 31, 127), /, :, \. This blacklist should probably be expanded to also include Unicode control and whitespace ~~and underscore~~ characters.

I addition, double underscores __ have special meaning in relation to OGC netCDF-LD, specifically for prefixes, and should be mentioned as reserved for that purpose to not create interoperability clashes.

davidhassell · 2024-05-31T11:19:13Z

davidhassell
May 31, 2024
Maintainer

This blacklist should probably be expanded to also include Unicode controls, whitespaces and underscore characters.

Surely we don't want to disallow underscores!

3 replies

larsbarring May 31, 2024
Collaborator Author

Of course not!!!! My mistake (now fixed)

efisher008 Jun 13, 2024
Maintainer

Are (en) dashes/hyphens - supported in variable and attribute names? Should this character be included in the list? Apparently em dashes — are not standard ASCII characters, so probably that does not need to be specified if names are ASCII-only by default.

sethmcg Jun 13, 2024
Collaborator

In Issue #477 we decided to allow ASCII period and ASCII hyphen in attribute names only.

So either there will need to be two lists, or the list will need to be structured to allow for differences in different contexts.

ChrisBarker-NOAA · 2024-06-13T18:39:08Z

ChrisBarker-NOAA
Jun 13, 2024
Collaborator

Hmm -- I like this idea. But first i think we should make clear what the (long term) goal is:

Unicode is very complex, with a lot of subtleties -- There are efforts to manage that with normalization (https://www.unicode.org/reports/tr15/), and categorization of code points. (General Category. Partition of the characters into major classes such as letters, punctuation, and symbols, and further subclasses for each of the major classes.) Etc.

So I think we have essentially three options:

Stick with ASCII -- and maybe add some extras (Latin1?) - this is not great -- really doesn't allow real internationalization -- I think there's general consensus not to do that.
Use the Unicode categorization to restrict allowable characters -- there are a manageable number of such categories (30-ish).
Allow any Unicode code point, except for a defined blacklist (that's what this discussion is about)

I think the whole point of this discussion is that we don't want to do (1) anymore.

for (2) -- it seems appealing, but there's a lot of complexity, e.g. (from the Unicode spec)

Similarly, characters whose General_Category identifies them primarily as a symbol or as a
mathematical symbol may function in other contexts as punctuation or even paired punctuation. The most obvious such case is for U+003C “<” less-than sign and U+003E “>”
greater-than sign. These are given the General_Category gc = Sm because their primary
identity is as mathematical relational signs. However, as is obvious from HTML and XML,
they also serve ubiquitously as paired bracket punctuation characters in many formal syntaxes.

So it can get messing. Nevertheless, there is precedent -- for instance, Python has the following rules:

https://docs.python.org/3/reference/lexical_analysis.html#identifiers

A bit messy, but do-able.

However, there are still a number of complications -- one is NFKC normalization, and another is that Python treats some different Unicode characters as equivalent (e.g. Blackboard Bold "B" U+1D539 is the same as capital B U+0042) -- but only in context where the normalization is done (e.g. processing source code, but not when meta-programming, like setattr()) (sorry can't find a reference at the moment).

Frankly, it's a bit of a mess if people really do use the broad range of allowable characters.

That being said, I think that the CF problem is easier than Python, as CF isn't providing normalization -- only enforcement.

I'm inclined (at the moment -- I haven't thought it through too carefully) to go with (3) -- allow any Unicode code point except a given blacklist. Note that I say Code Point, not character, as some characters can be represented by different code points (e.g. accented characters) If we simply do "Code Point", then there is no issue of normalization, or anything else.

(hmm, option 3(b) -- any code point, but a particular normalization?)

Though maybe that's too much a wild west?

0 replies

sethmcg · 2024-06-14T16:24:27Z

sethmcg
Jun 14, 2024
Collaborator

I'd like to course-correct the discussion a bit, if I may. This is not a proposal to expand the list of allowed characters in a wide-reaching way. That's what #237 is about, and a number of folks (including me and Lars) concluded that it would be unwise; there are a lot of security and interoperability concerns that make it important to consider any expansions of the list carefully and cautiously before adding them.

I believe what Lars is proposing is that we add an explicit, stand-alone listing of the sets of banned and allowed characters, rather than only having them defined implicitly in the text of section 3.2. I can see the value in that, but I think we shouldn't frame it as a list of banned characters, because that implies that anything not on the list is allowed, and as discussed in #237, there are important reasons that the default answer for whether a character is allowed should be "no". I think we should have an explicit list of allowed characters, with an accompanying list (maybe an extra column) of clarifications to cover the known disallowed characters that Lars suggests. So maybe something like:

Allowed	Clarification
`-`	`-` is the ASCII hyphen-minus, ASCII 45 / Unicode U+002D. Other dash characters (unicode en-dash, em-dash, minus sign, soft hyphen, non-breaking hyphen, etc.) are not allowed. This character is only allowed in attribute names, not variable names.

0 replies

ChrisBarker-NOAA · 2024-06-14T18:00:45Z

ChrisBarker-NOAA
Jun 14, 2024
Collaborator

@sethmcg: sorry about that -- I think it was me that expanded the conversation.

However, the reason I did that is that I don't see how we can talk about a blacklist without the context of what's allowed, so I was trying to get at that.

However, I think maybe I get it now -- this proposal for a "blacklist" is more internal to clearly define the rules now, and to guide any potential expansion in the future -- e.g.: whatever we do we won't allow THESE charactors :-)

I see the point of that, so carry on :-)

To that point:

"-" is the ASCII hyphen-minus, ASCII 45 / Unicode U+002D. Other dash characters (unicode en-dash, em-dash, minus sign, soft hyphen, non-breaking hyphen, etc.) are not allowed. This character is only allowed in attribute names, not variable names.

I find this odd to say -- are ANY other non-ascii charactors -- any number of other symbols, punctuation, etc allowed?

I think I get the point here, but it's a odd phrasing.

I think the point is that folks may be tempted to (or accidentally) use another symbol that "looks like" a dash.

In fact, I've had that issue in a totally different context, where something was copy and pasted from an application that had (helpfully) auto-changed an ascii dash to an endash.

So I don't see this as a blacklist so much as a "be cautious of these" list -- at least in that example.

Which I do think is good to document.

The real blacklist are the ones that will break other aspects of CF / netcdf (e.g. have special meaning in CDL)

-CHB

0 replies

sethmcg · 2024-06-14T18:57:45Z

sethmcg
Jun 14, 2024
Collaborator

@ChrisBarker-NOAA

I think the point is that folks may be tempted to (or accidentally) use another symbol that "looks like" a dash.
Yes, precisely. I think that would be a good addition to the conventions, and my impression is that that's what Lars is proposing, though I may be wrong.

I hadn't thought about compiling the list of characters that we definitely don't want to add for various technical reasons, just to have a consolidated reference for what they are and why they're banned. I agree that that would be a very useful thing to have, but I'm not sure about adding it to CF proper. I worry that people would see it and think of it as the complete list of all disallowed characters, and that everything else is allowed. Maybe we want to have that list, but make it an adjunct document of some kind, like the Guidelines for Constructing Standard Names? Or put it in an appendix?

0 replies

larsbarring · 2024-06-17T08:30:44Z

larsbarring
Jun 17, 2024
Collaborator Author

Wow, I was away from this issue for a few days while there have been a lot of activity and good points. When opening this discussion I had in mind was a rather modest extension to section 2.3, where the relevant part reads

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores. By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z. By the word digits we mean the standard ASCII digits 0 to 9, and similarly underscores means the standard ASCII underscore _. ... ... ... ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only.

Essentially this allows, as a recommendation, the US-ASCII (or their Unicode counterpart) letters and digits and underscore, as well as period and hyphen for attribute names. All other characters are implicitly not recommended (or "should not"), but not explicitly excluded or forbidden. What I had in mind was to marginally reduce this huge list of not recommended characters by explicitly disallowing the few characters that we already now know will create problems.

So far I am aware of the following, all within the US-ASCII character set, control characters (decimal 0 ... 31, 127), (space), /, \, : (I do not remember in what context the : surfaced, so maybe I am mistaken).

Based on this, my simplistic suggestion is to immediately after the text cited above add a sentence, something like

... ... ... ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only. The following ASCII characters must not be used: control characters (decimal 0 -31, 127), (space), /, \ and :.

In this minimal way we avoid all complications in relation to Unicode, and focus on those few we all agree, I think, cannot be used. All other punctuation (whether ASCII or Unicode), Unicode control and what not, remains as is, which basically means to be sorted out in the future.

0 replies

larsbarring · 2024-10-04T15:23:32Z

larsbarring
Oct 4, 2024
Collaborator Author

I have now explored this in some more detail using a python script to insert various unicode characters into the variable name in a small .cdl file and then use ncgen to generate a .nc file. In the same script I used NCO/ncrename trying to change the same character of a variable name in a working nc-file to all other characters in the list, and then use ncdump to create a cdl file. Thus it is not a full round-trip because the NCO step. I focussed on ASCII (decimal 0 - 127), ISO/IEC 8859-1 (decimal 0 - 255) and control (C1), as well as Unicode whitespace (WS) groups (all according to Wikipedia). Here is the result:

Code point	Decimal	Character "group"	ncgen / nco+ncdump
U+0000	0	ASCII "`nul`" Unicode control C0	NOT/NOT
U+0001 - U+0008	1 - 8	ASCII/Unicode control C0	NOT/OK
U+0009 - U+0010	9 - 10	ASCII/Unicode control C0	NOT/NOT
U+0011 - U+001F	11 - 31	ASCII/Unicode control C0	NOT/OK
U+0020	32	ASCII/ISO/IEC 8859-1 (space)	NOT/NOT
U+0021	33	ASCII/ISO/IEC 8859-1 `!`	NOT/OK
U+0022	34	ASCII/ISO/IEC 8859-1 `"`	NOT/NOT
U+0023 - U+0025	35 -37	ASCII/ISO/IEC 8859-1 `#` `$` `%`	NOT/OK
U+0026 - U+0029	38 - 41	ASCII/ISO/IEC 8859-1 `&` `'` `(` `)`	NOT/NOT
U+002A	42	ASCII/ISO/IEC 8859-1 *``**	NOT/OK
U+002B	43	ASCII/ISO/IEC 8859-1 `+`	OK/OK
U+002C	44	ASCII/ISO/IEC 8859-1 `,`	NOT/OK
U+002D - U+002E	45 - 46	ASCII/ISO/IEC 8859-1 `-` `.`	OK/OK
U+002F	47	ASCII/ISO/IEC 8859-1 `/`	NOT/OK
U+0030 - U+0039	48 - 57	ASCII/ISO/IEC 8859-1 digits	OK/OK
U+003A	58	ASCII/ISO/IEC 8859-1 `:`	NOT/OK
U+003B	59	ASCII/ISO/IEC 8859-1 `;`	NOT/NOT
U+003C - U+003F	60 - 63	ASCII/ISO/IEC 8859-1 `<` `=` `>` `?`	NOT/OK
U+0040	64	ASCII/ISO/IEC 8859-1 `@`	OK/OK
U+0041 - U+005A	65 - 90	ASCII/ISO/IEC 8859-1 `A` - `Z`	OK/OK
U+005B - U+005E	91 - 94	ASCII/ISO/IEC 8859-1 `[` `\` `]` `^`	NOT/OK
U+005F	95	ASCII/ISO/IEC 8859-1 `_`	OK/OK
U+0060	96	ASCII/ISO/IEC 8859-1 `	NOT/NOT
U+0061 - U+007A	97 - 122	ASCII/ISO/IEC 8859-1 `a` - `z`	OK/OK
U+007B	123	ASCII/ISO/IEC 8859-1 `{`	NOT/OK
U+007C	124	ASCII/ISO/IEC 8859-1 `\|`	NOT/NOT
u+007D - U+007E	125 - 126	ASCII/ISO/IEC 8859-1 `}` `~`	NOT/OK
U+007F	127	ASCII "`del`" Unicode control C0	NOT/OK
U+0080 - U+009F	128 - 159	Unicode control C1	OK/OK but screen printouts misbehave, some pretty badly (!)
U+00A0	160	ISO/IEC 8859-1, Unicode WS	OK/OK
U+00A1 - U+00FF	161 - 255	ISO/IEC 8859-1	OK/OK
U+1680 U+2000 - U+002A U+2028, U+2029 U+202F, U+205F U+3000	5760 8192 - 8202 8232, 8233 8239, 8287 12288	Unicode WS whitespace	OK/OK but screen printouts look strange

In doing this I used the most recent released version of the netCDF library tools (netcdf library version 4.9.2 of Jun 6 2024 10:57:38).

With respect to ASCII, I think that this is a pretty strong indication of which characters (groups) should not be not accepted in variable and attribute names.

And, yes, I do think that it better to be explicit about this and expressly rule out those characters we know are likely to cause problems because the CF conventions are all about data exchange and interoperability.

I think that it would be good to get such a statement into CF-1.12, what do you think ?

ping @sethmcg @ChrisBarker-NOAA @JonathanGregory @ethanrd @Dave-Allured @DocOtak @davidhassell

0 replies

JonathanGregory · 2024-10-04T17:43:14Z

JonathanGregory
Oct 4, 2024
Maintainer

Dear @larsbarring et al.

Thanks for your thorough investigation, Lars, and thanks everyone for the discussion. The text which Lars quoted above is not the working version. Following conventions issue #237, section 2.3 now reads

It is recommended that variable, dimension, attribute and group names begin with a letter and be composed of letters, digits, and underscores. By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z. By the word digits we mean the standard ASCII digits 0 to 9, and similarly underscores means the standard ASCII underscore _. Note that this is in conformance with the COARDS conventions, but is more restrictive than the netCDF interface which allows almost all Unicode characters encoded as multibyte UTF-8 characters (NUG Appendix B). The netCDF interface also allows leading underscores in names, but the NUG states that this is reserved for system use. ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only.

which is consistent with the conformance document. That is, as Lars says, we recommend against a lot of characters. All characters except letters, digits, underscores and (for attributes only) ASCII 2D . and 2E - are recommended not to be used.

In the discussion of conventions #237 we agreed that all characters are allowed, despite the recommendation (which is not a requirement) not to use the majority of them. Lars commented that the CF conventions "essentially provide a whitelist of explicitly allowed characters. All other characters are not recommended (or recommended against) but not explicitly disallowed. But throughout this conversation there have been several remarks that some characters should indeed be explicitly disallowed. This could easily be done by ... creating a blacklist." That's what this discussion is about, if I understand correctly.

The last sentence of the working text as above is unsatisfactory, despite #237, because it says . and - are "allowed". Those two characters are certainly "allowed", because all characters are allowed. What it means is that we aren't recommending against them. We should fix that.

The working text is also unsatisfactory because it implies that the NUG prohibits some characters ("it allows almost all Unicode characters ...") but it doesn't say which ones are not allowed. NUG Appendix B says that names should match the regular expression

([a-zA-Z0-9_]|{MUTF8})([^\\x00-\\x1F/\\x7F-\\xFF]|{MUTF8})*
MUTF8        = <multibyte UTF-8 encoded, NFC-normalized Unicode character>

I suppose we should understand the regular expression to begin with ^ and end with $ i.e. it's the complete name. Do you agree?

Since ASCII is a subset of UTF-8, I think that by "multibyte UTF-8 encoded", the NUG must mean a Unicode character which is encoded in more than one byte by UTF-8. That is, MUTF8 doesn't include one-byte characters, among them the ASCII characters 00-7F. Do you agree?

If that's correct, the NUG does not allow / (which is in the middle of the second [...] expression), or the one-byte characters 00-1F (ASCII control characters) and 7F-FF. Lars's experiments agree that ncgen does not work with the control characters, / and 7F, but apparently it does work with 80-FF.

I think we should explicitly state that we prohibit 00-1F, / and 7F-FF, if I'm correct that NUG doesn't allow them anyway. The CF text is currently vague, but it says CF is "more restrictive" than the netCDF interface, for which it cites the NUG (although Lars's experiment shows that the netCDF interface is more forgiving than the NUG). This implies that CF upholds NUG restrictions, as you'd expect.

Also, the the CF working text is inconsistent with the NUG in saying "It is recommended that variable, dimension, attribute and group names begin with a letter". This is not merely a recommendation, because the NUG says that names must begin with a letter, digit, underscore or multi-byte UTF-8 character. We should fix this. Our text currently implies it's OK to start a name with a punctuation mark, for instance, which the NUG prohibits.

Lars's experiment shows that ncgen doesn't allow space (20) or any of the puncuation marks 21-2F except + - and ., nor any of the symbols 3A-3F, 5B-5E, 60, 7B-7F. To put it positively, ncgen allows only letters, digits, _ + - . and @. Thus, ncgen is more restrictive than the NUG.

I think it would be reasonable for CF to prohibit all those characters which ncgen doesn't support, and which therefore could not be used in CDL. That would be a backward-incompatible change, which we don't normally make, if in fact any existing data uses any of those characters in netCDF names. Given Lars's experience, however, it seems unlikely anyone would have used them, despite NUG allowing them.

We've decided to allow . and - in attribute names, but not other names. What about + and @, which are the only two one-byte characters so far not considered. NUG and ncgen allow them, so I think CF should continue to allow them. At the moment, we recommend that they should not be used, since they aren't in our current whitelist.

Best wishes

Jonathan

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CF Conventions

Creating a blacklisting certain characters from variable and attribute names #323

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

CF Conventions

Creating a blacklisting certain characters from variable and attribute names #323

larsbarring May 31, 2024 Collaborator

Topic for discussion

Replies: 8 comments · 3 replies

davidhassell May 31, 2024 Maintainer

larsbarring May 31, 2024 Collaborator Author

efisher008 Jun 13, 2024 Maintainer

sethmcg Jun 13, 2024 Collaborator

ChrisBarker-NOAA Jun 13, 2024 Collaborator

sethmcg Jun 14, 2024 Collaborator

ChrisBarker-NOAA Jun 14, 2024 Collaborator

sethmcg Jun 14, 2024 Collaborator

larsbarring Jun 17, 2024 Collaborator Author

larsbarring Oct 4, 2024 Collaborator Author

JonathanGregory Oct 4, 2024 Maintainer

larsbarring
May 31, 2024
Collaborator

Replies: 8 comments 3 replies

davidhassell
May 31, 2024
Maintainer

larsbarring May 31, 2024
Collaborator Author

efisher008 Jun 13, 2024
Maintainer

sethmcg Jun 13, 2024
Collaborator

ChrisBarker-NOAA
Jun 13, 2024
Collaborator

sethmcg
Jun 14, 2024
Collaborator

ChrisBarker-NOAA
Jun 14, 2024
Collaborator

sethmcg
Jun 14, 2024
Collaborator

larsbarring
Jun 17, 2024
Collaborator Author

larsbarring
Oct 4, 2024
Collaborator Author

JonathanGregory
Oct 4, 2024
Maintainer