Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for attributes of type string #141

Open
JimBiardCics opened this issue Jul 23, 2018 · 110 comments · May be fixed by #543
Open

Add support for attributes of type string #141

JimBiardCics opened this issue Jul 23, 2018 · 110 comments · May be fixed by #543
Assignees
Labels
defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors

Comments

@JimBiardCics
Copy link
Contributor

JimBiardCics commented Jul 23, 2018

Attributes with a type of string are now possible with netCDF-4, and many examples of attributes with this type are "in the wild". As an example of how this is happening, IDL creates an attribute with this type if you select its version of string type instead of char type. It seems that people often assume that string is the correct type to use because they wish to store strings, not characters.

I propose to add verbiage to the Conventions to allow attributes that have a type of string. There are two ramifications to allowing attributes of this type, the second of which impacts string variables as well.

  1. A string attribute can contain 1D atomic string arrays. We need to decide whether or not we want to allow these or limit them (at least for now) to atomic string scalars. Attributes with arrays of strings could allow for cleaner delimiting of multiple parts than spaces or commas do now (e.g. flag_values and flag_meanings could both be arrays), but this would be a significant stretch for current software packages.
  2. A string attribute (and a string variable) can contain UTF-8 Unicode strings. UTF-8 uses variable-length characters, with the standard ASCII characters as the 1-byte subset. According to the Unicode standard, a UTF-8 string can be signaled by the presence of a special non-printing three byte sequence known as a Byte Order Mark (BOM) at the front of the string, although this is not required. IDL (again, for example) writes this BOM sequence at the beginning of every attribute or variable element of type string.

Allowing attributes containing arrays of strings may open up useful future directions, but it will be more of a break from the past than attributes that have only single strings. Allowing attributes (and variables) to contain UTF-8 will free people to store non-English content, but it might pose headaches for software written in older languages such as C and FORTRAN.

To finalize the change to support string type attributes, we need to decide:

  1. Do we explicitly forbid string array attributes?
  2. Do we place any restrictions on the content of string attributes and (by extension) variables?

Now that I have the background out of the way, here's my proposal.

Allow string attributes. Specify that the attributes defined by the current CF Conventions must be scalar (contain only one string).

Allow UTF-8 in attribute and variable values. Specify that the current CF Conventions use only ASCII characters (which are a subset of UTF-8) for all terms defined within. That is, the controlled vocabulary of CF (standard names and extensions, cell_methods terms other than free-text elements of comments(?), area type names, time units, etc) is composed entirely of ASCII characters. Free-text elements (comments, long names, flag_meanings, etc) may use any UTF-8 character.

Trac ticket: #176 but that's just a placeholder - no discussion took place there (@JonathanGregory 26 July 2024)

@Dave-Allured
Copy link
Contributor

Dave-Allured commented Jul 23, 2018

I am generally in support of this string attributes proposal, including UTF-8 characters. However, for CF controlled attributes, I recommend an explicit preference for type char rather than string. This is for compatibility with large amounts of existing user code that access critical attributes directly, and would need to be reworked for type string.

I suggest not including a constraint for scalar strings, simply because it seems redundant. I think that existing CF language strongly implies single strings in most cases of CF defined attributes.

@ghost
Copy link

ghost commented Jul 24, 2018

How different is reading values from a string attribute compared to a string variable? If some software supports string variables shouldn't it support string attributes as well? If the CF is going to recommend char datatype for string-valued attributes, shouldn't the same be done for string-valued variables?

Prefixing the bytes of an UTF-8 encoded string with the BOM sequence is an odd practice. Although it is permitted, afaik, it is not recommended.

Since what gets stored are always the bytes of one string in some encoding, assuming UTF-8 always should take care of the ASCII character set, too. This could cause issues if someone used other one-byte encodings (e.g. ISO 8859 family) but I don't see how such cases could be easily resolved.

Stroring Unicode strings using the string datatype makes more sense since the number of bytes for such strings in UTF-8 encoding is variable.

@JimBiardCics
Copy link
Contributor Author

This issue and issue #139 are intertwined. There may be overlapping discussion in both.

@JimBiardCics
Copy link
Contributor Author

@ajelenak-thg So I did some digging. I wrote a file with IDL and read it with C. There are no BOM prefixes. I guess some languages (such as Python) make assumptions one way or another about string attributes and variables, but it appears that it's all pretty straightforward.

@JimBiardCics
Copy link
Contributor Author

@ajelenak-thg I agree that we should state that char attributes and variables should contain only ASCII characters.

@JimBiardCics
Copy link
Contributor Author

@Dave-Allured When you say "CF-controlled attributes", are you referring to any values they may have, or to values that are from controlled vocabularies?
It is true that applications written in C or FORTRAN will require code changes to handle string because the API and what is returned for string attributes and variables is different from that for char attributes and variables.
Would a warning about avoiding string for maximum compatibility be sufficient?

@Dave-Allured
Copy link
Contributor

Dave-Allured commented Jul 24, 2018

@JimBiardCics, by "CF-controlled attributes", I mean CF-defined attributes within "the controlled vocabulary of CF" as you described above. By implication I am referring to any values they may have, including but not limited to values from controlled vocabularies.

A warning about avoiding data type string is notification. An explicit preference is advocacy. I believe the compatibility issue is important enough that CF should adopt the explicit preference for type char for key attributes.

@Dave-Allured
Copy link
Contributor

Dave-Allured commented Jul 24, 2018

The restriction that char attributes and variables should contain only ASCII characters is not warranted. The Netcdf-C library is agnostic about the character set of data stored within char attributes and char variables. UTF-8 and other character sets are easily embedded within strings stored as char data.

Therefore I suggest no mention of a character set restriction, outside of the CF controlled vocabulary. Alternatively you could establish the default interpretation of string data (both char and string data types) as the ASCII/UTF-8 conflation.

@DocOtak
Copy link
Member

DocOtak commented Jul 24, 2018

Hi all, I wasn't quite able to form this into a coherent paragraphs so here are some things to keep in mind re: UTF8 vs other encodings:

  • UTF8 is backwards compatible with ASCII if the following are true: no byte order mark, all code points are between U+0000 and U+007F
  • UTF8 is not backwards comparable with Latin1 (ISO 8859-1) because code points above U+007F need two bytes to represent.
  • There are multiple ways of representing the same grapheme, the netCDF classic format required UTF8 to be in Normalization Form Canonical Composition (NFC)

My personal recommendation is that the only encoding for text in CF netCDF be UTF8 in NFC with no byte order mark. For attributes where there is desire to restrict what is allowed (though controlled vocabulary or other limitations), the restriction should be specified using unicode points, e.g. "only printing characters between U+0000 and U+007F are allowed in controlled attributes".

Text which is in controlled vocabulary attributes should continue to be char arrays. Freeform attributes (mostly those in 2.6.2. Description of file contents), could probably be either string or char arrays.

@Dave-Allured
Copy link
Contributor

@DocOtak, you said "the netCDF classic format required UTF8 to be in Normalization Form Canonical Composition (NFC)". This restriction is only for netCDF named objects, i.e. the names of dimensions, variables, and attributes. There is no such restriction for data stored within variables or attributes.

@DocOtak
Copy link
Member

DocOtak commented Jul 24, 2018

@Dave-Allured yes, I reread the section, object names does appear to be what it is restricting. Should there be some consideration of specifying a normalization for the purposes of data in CF netcdf?

Text encoding probably deserves its own section in the CF document, perhaps under data types. The topic of text encoding can be very foreign to someone who thinks that "plain text" is a thing that exists in computing.

@Dave-Allured
Copy link
Contributor

@DocOtak, for general text data, I think UTF-8 normalization is more of a best practice than a necessity for CF purposes. Therefore I suggest that CF remain silent about that, but include it if you feel strongly. Normalization becomes important for efficient string matching, which is why netCDF object names are restricted.

@DocOtak
Copy link
Member

DocOtak commented Jul 24, 2018

@Dave-Allured I don't know enough about the consequences of requiring a specific normalization. There is some interesting information on the unicode website about normalization. Which suggests that over 99% of unicode text on the web is already in NFC. Also interesting is that combining NFC normalized strings may not result in a new string that is normalized. It is also stated in the FAQ that "Programs should always compare canonical-equivalent Unicode strings as equal", so it's probably not an issue as long as the controlled vocabulary attributes have values with code points in the U+0000 and U+007F range (control chars excluded).

@hrajagers
Copy link

@Dave-Allured and @DocOtak,

  1. Most of the character/string attributes in the CF conventions contain a concatenation of sub-strings selected from a standardized vocabulary, variable names, and some numbers and separator symbols. It seems that for those attributes the discuss about the encoding is not so relevant as these sub-strings contain only a very basic set of characters (assuming that variable names are not allowed to contain extended characters). Even for flag_meanings the CF conventions state "Each word or phrase should consist of characters from the alphanumeric set and the following five: '_', '-', '.', '+', '@'." If the alphanumeric set doesn't include extended characters this again doesn't create any problems for encoding. The only attributes that might contain extended characters (and thus be influenced by this encoding choice) are attributes like long_name, institution, title, history, ... However CF inherits most of them from the NetCDF User Guide which explicitly states that they should be stored as character arrays (see NUG Appendix A) So, is it then up to CF to allow strings here? In short, I'm not sure the encoding is important for string/character attributes at this moment.

  2. I initially raised the encoding topic in the related issue Add support for variables of type string #139 because we want our model users to use local names for observation points and they will end up in label variables. In that context I would like to make sure that what I store is properly described.

@JimBiardCics
Copy link
Contributor Author

@hrajagers Thanks for the pointer to NUG Appendix A. It's interesting to see in that text that character array, character string, and string are used somewhat interchangeably. I'm curious to know if the NUG authors looked at this section in light of allowing string type.

@ghost
Copy link

ghost commented Jul 25, 2018

I think we are making good progress on this. I checked the Appendix A table of CF attributes and I think the following attributes can be allowed to hold string values as well as char:

  • comment
  • external_variables
  • _FillValue
  • flag_meanings
  • flag_values
  • history
  • institution
  • long_name
  • references
  • source
  • title

All the other attributes should hold char values to maximize backward compatibility.

@JimBiardCics
Copy link
Contributor Author

@ajelenak-thg Are you suggesting the other attributes must always be of type char, or that they should only contain the ASCII subset of characters?

@ghost
Copy link

ghost commented Jul 25, 2018

Based on the expressed concern so far for backward compatibility I suggested the former: always be of type char. Leave the character set and encoding unspecified since the values of those attributes are controlled by the convention.

@ghost
Copy link

ghost commented Jul 25, 2018

On the string encoding issue, CF data can be currently stored in two file formats: NetCDF Classic, and HDF5. String encoding information cannot be directly stored in the netCDF Classic format and the spec defines a special variable attribute _Encoding for that in future implementations. The values of this attribute are not specified so anything could be used.

In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. Both char and string datatypes in the context of this discussion are stored as HDF5 strings. This effectively limits what could be allowed values of the (future) _Encoding attribute for maximal data interoperability between the two file formats.

@Dave-Allured
Copy link
Contributor

@hrajagers said: However CF inherits most of them [attributes] from the NetCDF User Guide which explicitly states that they should be stored as character arrays (see NUG Appendix A) So, is it then up to CF to allow strings here?

Yes, NUG Appendix A literally allows only char type attributes. My sense is that proponents believe that string type is compatible with the intent of the NUG, and also strings have enough advantages to warrant departure from the NUG.

Personally I think string type attributes are fine within collaborations where everyone is ready for any needed code upgrades. For exchanged and published data, char type CF attributes should be preferred explicitly by CF.

@Dave-Allured
Copy link
Contributor

@ajelenak-thg said: In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. Both char and string datatypes in the context of this discussion are stored as HDF5 strings.

Actually the ASCII/UTF-8 restriction is not enforced by the HDF5 library. This is used intentionally by netcdf developers to support arbitrary character sets in netcdf-4 data type char, both attributes and variables. See netcdf issue 298. Therefore, data type char remains fully interoperable between netcdf-3 and netcdf-4 formats.

For example, this netcdf-4 file contains a char attribute and a char variable in an alternate character set. You will need an app or console window enabled for ISO-8859-1 to properly view the ncdump of this file.

@JonathanGregory
Copy link
Contributor

Dear Jim

Thanks for addressing these issues. In fact you've raised two issues: the use of strings, and the encoding. These can be decided separately, can't they?

On strings, I agree with your proposal and subsequent comments by others that we should allow string, but we should recommend the continued use of char, giving as the reason that char will maximise the useability of the data, because of the existence of software that isn't expecting string. Recommend means that the cf-checker will give a warning if string is used. However it's not an error and a given project could decide to use string.

For the attributes whose contents are standardised by CF e.g. coordinates, if string is used we should require a scalar string. This is because software will not expect arrays of strings. These attributes are often critical and so it's essential they can be interpreted. For CF attributes whose contents aren't standardised e.g. comment, is there a strong use-case for allowing arrays of strings?

I recall that at the meeting in Reading the point was made that arrays would be natural for flag_values and flag_meanings. I agree that the argument is stronger in that case because the words in those two attributes correspond one-to-one. Still, it would break existing software to permit it. Is there a strong need for arrays?

Best wishes

Jonathan

@JimBiardCics
Copy link
Contributor Author

@JonathanGregory I agree with you. I think it would be fine to leave string array attributes out of the running for now. I also prefer the recommendation route.

Regarding the encoding, it seems to me that we could avoid a lot of complexity for now by making a simple requirement that all CF-defined terms and whitespace delimiters in string-valued attributes or variables be composed characters from the ASCII character set. It wouldn't matter if people used Latin-1 (ISO-8859-1) or UTF-8 for any free text or free text portions of contents, because they both contain the ASCII character set as a subset. The parts that software would be looking for would be parseable.

@ethanrd
Copy link
Member

ethanrd commented Jul 26, 2018

@JonathanGregory Another use-case (that I think came up during the Reading meeting) had the history attribute as a string array so that each element could contain the description of an individual processing step. I think easier machine readability was mentioned as a motivation.

@JonathanGregory
Copy link
Contributor

Regarding the encoding, I agree that for attribute contents which are standardised by CF it is fine to restrict ourselves to ASCII, in both char and string. For these attributes, we prescribe the possible values (they have controlled vocabulary) and so we don't need to make a rule in the convention about it for the sake of the users of the convention. If we put it in the convention, it would be as guidance for future authors of the convention. I don't have a view about whether we should do this. It would be worth noting to users that whitespace, which often appears in a "black-separated list of words", should be ASCII space. I agree that UTF-8 is fine for contents which aren't standardised.

Regarding arrays of strings, I realise I wasn't thinking clearly yesterday, sorry. As we've agreed, string attributes will not be expected by much existing software. Hence software has to be rewritten to support the use of strings in any case, and support for arrays of strings could be added at the same time, if it's really valuable. I don't see the particular value for the use of string arrays for comment - do other people? For flag_meanings, the argument was that it would allow a meaning to be a string which contained spaces (instead of being joined up with underscores, as is presently necessary); that is, it would be an enhancement to functionality.

Happy weekend - Jonathan

@JonathanGregory
Copy link
Contributor

I meant to write, I don't see the particular value for the use of string arrays for history, which Ethan reminded us of. Why would this be more machine-readable?

@JimBiardCics
Copy link
Contributor Author

@JonathanGregory The use of an array of strings for history would simplify denoting where each entry begins and ends as entries are added, appending or prepending a new array element each time rather than the myriad different ways people do it now. This would actually be a good thing to standardize.

I think we can just not mention string array attributes right now. The multi-valued attributes (other than history, perhaps) pretty much all specify how they are to be formed and delimited.

@cf-metadata-list
Copy link

cf-metadata-list commented Jul 27, 2018 via email

@kenkehoe
Copy link

kenkehoe commented Jul 27, 2018 via email

@JimBiardCics
Copy link
Contributor Author

@DocOtak @zklaus Sorry if I've pulled the discussion off track. The question of exactly why NUG worded things the way they did is intriguing, but I think Klaus is right that we shouldn't get wrapped around that particular axle in this issue — particularly if we are going to split encoding off into a different issue. I think the take-away is that our baseline is "sane utf-8 unicode" for attributes of type NC_STRING and ASCII for attributes of type NC_CHAR (those created with the C function nc_put_att_text.)

@zklaus
Copy link

zklaus commented Mar 17, 2020

I agree and would go one small step further: UTF-8 is only an encoding, so we should just say "unicode" for strings. If we need to restrict that, say to disallow underscore in the beginning or to save a separation character like space in attributes right now, we should do so at the character level, possibly using categories as introduces by @ChrisBarker-NOAA above.

@ChrisBarker-NOAA
Copy link
Contributor

UTF-8 is only an encoding, so we should just say "unicode" for strings.

We could do that if and only if netcdf itself was clear about how Unicode is encoded in files. Which it is for variable names, though not so sure it is anywhere else.

But even so, once the encoding has been specified, then yes, talking about Unicode makes sense.

Agreed, it's not for this discussion, but:

MUTF8 is not quite (In that doc): "any unicode string encoded as normalized UTF-8." because I think they are specifically trying to exclude the ASCII subset, so they can handle that separately. i.e characters that are excluded, like "/" are indeed unicode strings.

But it's a pretty contorted way to describe it -- but that's netcdf's problem :-)

@zklaus
Copy link

zklaus commented Mar 18, 2020

Ah yes, I see what you mean, you are right: Always speaking about UTF-8, multi-byte here isn't referring to the possibility of having several bytes encode one code point, but to actual code points with more than one byte, thus excluding the one-byte code points which are exactly the first 128 ASCII characters. Then they allow back in specific ASCII characters.

@JonathanGregory
Copy link
Contributor

Dear all

The issue was opened in 2018 and has seen a long discussion, but no further contributions since 2020. It has been partly superseded, in that CF now permits string-valued attributes to be either a scalar string or a 1D character array (see Sect 2.2). Apart from that, it seems to me that the discussion was mostly concerned with three subjects:

  1. Should CF allow arrays of strings in attributes? We are currently discussing that question in https://github.com/orgs/cf-convention/discussions/341, which refers back to this issue. Therefore I propose we don't discuss this any further here.

  2. What encoding should be used in string attributes. The consensus was that it should always be Unicode. One reason for this is that netCDF variable names are in Unicode, and many CF attributes contain the names of netCDF variables. CF recommends that only letters, digits and underscores should be used for variable names, but does not prohibit other Unicode characters. Should we insert a statement in the CF convention about strings being Unicode?

  3. Whether to restrict the characters allowed in string-valued attributes. The majority of CF attributes contain the names of netCDF variables and strings which come from a CF controlled vocabulary or a list in an Appendix. The set of characters that can be used in those attributes is thus dictated already by the convention. This question therefore applies only to the attributes that CF defines but whose contents it does not standardise, namely comment history institution references source title and long_name. Does anyone wish to pursue this third question? For instance, @ChrisBarker-NOAA, @zklaus and @DocOtak all contributed in 2020.

I propose that this issue should be closed as dormant if no-one resumes discussion on Q2 or Q3 within the next three weeks, before 14th September.

Cheers

Jonathan

@ChrisBarker-NOAA
Copy link
Contributor

Thanks for trying to close this out :-)

Should we insert a statement in the CF convention about strings being Unicode?

I jsut looked, and all I see is this under naming:

"...is more restrictive than the netCDF interface which allows almost all Unicode characters encoded as multibyte UTF-8 characters """

So yes, I think it's good to be clear there -- maybe it's well defined by netcdf, but it doesn't hurt to be explicit, if repetivive.

I am correct to say that all strings in netCDF are Unicode, encoded as UTF-8 ?

Whether that's true or not for netcdf -- I think it should be true for CF, and we should say that explicitly in any case.

... This question therefore applies only to the attributes that CF defines but whose contents it does not standardise,

I would say that we should not restrict these otherwise not-restricted atributes.

I'm not sure if that's pursuing it or not pursuing it -- I presume the default is no restrictions?

@ChrisBarker-NOAA
Copy link
Contributor

Hmm -- not sure where this fits, but it's related:

IIUC, CF now allows either the new vlen strings, or the "tradoitional" char arrays.

The trick is that UTF-8 is not a a one-char-per-codepoint encoding.

Could we say that you can only use Unicode (UTF-8) with vlen strings, and char arrays can only hold ASCII? or is the cat way to far out of the bag for that?

Probably - could we at least encourage vlen strings for non-ascii text?

@JonathanGregory
Copy link
Contributor

JonathanGregory commented Aug 27, 2024

@ChrisBarker-NOAA

Am I correct to say that all strings in netCDF are Unicode, encoded as UTF-8 ?

I don't know either.

Whether that's true or not for netcdf -- I think it should be true for CF, and we should say that explicitly in any case.

I think so as well. That would go sensibly in Sect 2.2 "Data types".

We've already said in 2.2 that scalar vlen strings and 1D char arrays are both allowed and are equivalent in variables. We did not say so for attributes, but I expect everyone would assume that the same applies, in which case we should make it explicit. I don't think there's a problem with storing a multi-byte character codes in a char array, is there? It would be clearest if we said that a 1D char array should always be interpreted as a Unicode string. An ASCII string is a special case of that, so it's backwards-compatible.

Cheers

Jonathan

@JonathanGregory
Copy link
Contributor

No-one said they wanted to resume Q1 or Q3 within three weeks, but @ChrisBarker-NOAA and I agreed that it would be useful to clarify that strings stored in variables or attributes should be Unicode characters (Q2). To do that, I propose that we replace the first 1.5 sentences of the second para of sect 2.2 "Data Types", which currently reads

Strings in variables may be represented one of two ways - as atomic strings or as character arrays. An n-dimensional array of strings may be implemented as a variable of type string with n dimensions, or as a variable of type char with n+1 dimensions, where the most rapidly varying dimension ...

with

A text string in a variable or an attribute may be represented either in Unicode characters stored in a string or encoded as UTF-8 and stored in a char array. Since ASCII 7-bit character codes are a subset of UTF-8, a char array of m ASCII characters is equivalent to a string of m ASCII characters. Unicode characters which are not in the ASCII character set require more than one byte each to encode in UTF-8. Hence a string of length m generally requires a UTF-8 char array of size >m to represent it.

An n-dimensional array of strings may be implemented as a variable or attribute of type string with n dimensions (where n<2 for an attribute) or as a variable (but not an attribute) of type char with n+1 dimensions, where the most rapidly varying dimension ...

Also, I suggest inserting the clarification "which has variable length", in this sentence in the first paragraph:

The string type, which has variable length, is only available in files using the netCDF version 4 (netCDF-4) format.

Does that look all right, @ChrisBarker-NOAA, @zklaus, anyone else? I believe this is no change to the convention, just clarification of the text, so I'm going to relabel this issue as a defect. Please speak up if you disagree. Thanks.

@JonathanGregory JonathanGregory added defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors and removed enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format labels Sep 17, 2024
@ChrisBarker-NOAA
Copy link
Contributor

As PR #543 attempts to clarify a bit about Unicode, I thought I'd post here. I stared commenting on the PR, but realized I had way too much to say for a PR, so I'm putting it here.

NOTE: maybe this should be a different issue -- specifically about Unicode in CF -- but I'm putting it here for now -- we can opy to a new issue if need be.

First some definitions/descriptions discussion about Unicode and strings.

  1. there is no such thing as a Unicode "character". Unicode defines "code points", and each code point is assigned a value. However: "Code points are the numbers assigned by the Unicode Consortium to every character in every writing system." -- so interchanging "code point" and "character" is probably OK and will lead to little confusion. (one difference is how Unicode handles accented charioteers and the like, so it's not quite one-to-one code point-to-charactor).

2: There is no such thing as a Unicode String (except where defined by a programming language, e.g. Python). When stored in memory or a file strings, Unicode or not, are stored as bytes, and the relationship between the bytes and the code points is defined by an encoding. Without an encoding, there is no clear way to define what a bunch of bytes means, or in reverse, how to store a particular set of code points.

  • ANSI encodings are one (8 bit) byte per charactor (so easy!), and ASCII is 1 7 bit byte per character (so only 128 different chars). But therefor ANSI encodings can store only 255 different code points.

  • To store all possible Unicode code-points requires a 32 bit integer (4 bytes) -- that's the UCS-4 (UTF-32) encoding -- one-to-one relationship between integer value and code-point.

  • Other encodings that can store all of Unicode are "variable length encodings" -- a given code point can be a variable number of bytes. These allow more compact storage, but also more complexity in interpretation. Examples are UTF-8 (each code point is 8 or more bits) and UTF-16 (each character is 16 or more bits).

UTF-8 is the most common Unicode encoding for sotage of text in files, or passing over the internet (via https, or ...). UTF-16 is used internally by Windows and Java (I think).

Anyway -- unless one wants to use UCS-32 (which is what the numpy Unicode type uses), which most folks don't want to use for file storage -- it's pretty wasteful of space for virtually all text) then a variable-length encoding is required. And a char array is not ideal for variable length encodings -- because a char array requires a fixed size, and you don't know what size is needed until you encode the text. So a variable length string array is the "right" choice for Unicode (non-ansi) text.

Char arrays and strings in netcdf.

So this brings us to the topic at hand -- in netcdf3 the only what to store text was in arrays of type char. This maps directly to the char* used to store text in C. So a pretty direct mapping to C (and other languages).

With netcdf4, a string type was introduced: Strings are variable length arrays of chars, while char arrays are fixed length.

So: as far as the netcdf spec is concerned, the only difference between a char array and a string is that the length of char array is fixed. Once you read it -- you have a char*.

That's all I could find in the netCDF docs. Nothing about Unicode or encodings, or ... Which means that as far as the netcdf spec is concerned, you can put anything in either data type.

Note that a char* in C, while used for text (hence the name) is really a generic array of bytes -- it can be used to store any old collection of data.

So enter Unicode: as above, to store a "Unicode String" i.e. collection of code points, requires that the string be encoded, resulting in a set of bytes that can be stored in, you guessed it a char*. (on Windows, the standard encoding is UTF-16, so a wchar* ("wide char") is used. But a wchar* can be cast to a char* -- it's still an array of bytes (unsigned eight bit ints).

So as far as netcdf is concerned, you can stuff Unicode text into either a char array or a string in netcdf.

Note that I did find this discussion:
Unidata/netcdf-c#402
from May-June 2017 and not closed yet. From the netCDF docs, I don't think it was ever resolved. But it does contain a proposal for using an _Encoding attribute, and it may be kinda-sorted adopted by netCDF4 Python lib (it does respect the _Encoding atribute of char arrays), but I can't find documentation for how it handles the netcdf string type. and it looks like utf-8 is the default:

def chartostring(b, encoding='utf-8')

def stringtochar(a, encoding='utf-8')

I also don't know what it does for attributes, because they can't have another attribute to store the _Encoding. So ?? However, it does seem to "just work" -- at least if you write the file with Python -- e.g. you can ncdump it and it will correctly show a non-ascii character (on my terminal, which may be utf-8?).

Anyway -- as this doesn't seem to be defined by any published spec, I hope we can define it for CF. My proposal:

In pretty much any context:

  • char arrays should only be used to store ANSI encoded text. i.e. 1-byte-per character. maybe we could restrict that to ASCII or latin-1 ? (latin-1 is a superset of ASCII).

  • For text that can not be stored in an ansi encoding (i.e. Unicode text), the string type should be used.

    • string and string array attributes are stored with the utf-8 encoding. (note that ASCII is a strict subset of utf-8, so ASCII is also legal.
    • string and string array variables are stored encoded at utf-8 be default, or in the encoding specified by the _Encoding attribute.

That's it -- pretty simple, really :-)

Points to consider:

  1. Should we restrict char arrays to ascii, or latin-1? (or allow other 1-byte encodings with an _Encoding attribute?
  2. Should we allow the _Encoding attribute? or just say "thou shalt use only UTF-8"

My thought -- as much as I'd love to be fully restrictive to make things simpler for everyone, the cat's probably out of the bag. So we may have to require as few restrictions as possible (e.g. _Encoding), but recommend either ASCII or UTF-8.

So -- enough words for you?

-- back in the day, a char* would be an ASCII or ANSI encoding string (null terminated), and all was good an simple.

@JonathanGregory
Copy link
Contributor

Dear Chris

Thanks for the research and your useful exposition of the complexity of the issue. I was hoping that we could add a couple of sentences on this subject, rather than a new chapter. :-)

NetCDF allows Unicode characters in names (of dimensions, variables and attributes). The relevant text from the NUG v1.1 is as follows. (By the way, this quotation indicates that Unidata also regard it as OK to refer to Unicode "characters" instead of "codepoints", in the interest of easy understanding.)

Beginning with versions 3.6.3 and 4.0, names may also include UTF-8 encoded Unicode characters as well as other special characters, except for the character '/', which may not appear in a name. Names that have trailing space characters are also not permitted.

We've agreed that CF should not prohibit characters permitted by the NUG, although we recommend a more restricted list of characters in sect 2.3:

It is recommended that variable, dimension, attribute and group names begin with a letter and be composed of letters, digits, and underscores. By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z. By the word digits we mean the standard ASCII digits 0 to 9, and similarly underscores means the standard ASCII underscore _.

In the previous discussion on this issue, an important point was made, that many CF attributes identify netCDF variables or attributes by name e.g. coordinates="lat lon". Therefore

  • Any valid character in a netCDF name might appear in one of these CF attributes.

  • Hence CF must allow any Unicode character in a string-valued attribute.

  • Since we allow char arrays as equivalent to strings, we can't restrict char arrays to ASCII only (your final point 1). Any Unicode character must be possible in a char array as well. Non-ASCII characters may already have been used in existing data, so we shouldn't restrict them now (following our usual generous principle).

On your final point 2, in my text above I proposed that we should require UTF-8 encoding for char arrays. We haven't said anything about this before, and we didn't provide a way to record the encoding, so for existing char data the only possibility is to guess what encoding was used, if it's not ASCII. I think we could justifiably do either of the following, but we must do one or the other in order for char data to be properly usable:

  • Require UTF-8.

  • Recommend UTF-8, but provide a new attribute to record the encoding.

Which of these should we do?

For string data, I suppose the encoding isn't our concern, is it? I assume that netCDF strings support Unicode. Any interface to netCDF must therefore do likewise, and we can leave it to the netCDF interface of whatever language we use to deal with the encoding of the string data the user provides in that language.

Best wishes

Jonathan

@ChrisBarker-NOAA
Copy link
Contributor

ChrisBarker-NOAA commented Sep 18, 2024

I was hoping that we could add a couple of sentences on this subject, rather than a new chapter. :-)

There's still hope :-)

NetCDF allows Unicode characters in names (of dimensions, variables and attributes). The relevant text from the [NUG v1.1]

Darn that google! -- I could have saved a lot of writing if I'd found that.

names may also include UTF-8 encoded Unicode characters

OK -- very good -- UTF-8 it is -- whew!

We've agreed that CF should not prohibit characters permitted by the NUG,

That's clear then.

By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z. By the word digits we mean the standard ASCII digits 0 to 9

So CF recommends, but does not require, ASCII-only for names -- OK then, that helps, but doesn't avoid the issue :-).

Any valid character in a netCDF name might appear in one of these CF attributes.

Hence CF must allow any Unicode character in a string-valued attribute.

Darn -- but it is what it is.

Since we allow char arrays as equivalent to strings, we can't restrict char arrays to ASCII only (your final point 1).

Also darn. :-)

On your final point 2, in my text above I proposed that we should require UTF-8 encoding for char arrays.

Makes sense to me. And, in fact, there is a very strong justification for this:

  • UTF-8 is used for variable names.
  • variable names are often used as attributes (or parts of attributes).
  • Two char arrays will only compare equal, at the binary level, if they are encoded the same way.

This is critical, as many (most?) programming environments (C, FORTRAN) only work natively with raw binary data (e.g. char*). So it's pretty critical that all char (and string) data are encoded the same way.

the only possibility is to guess what encoding was used, if it's not ASCII.

And guessing is never good :-(

I think we could justifiably do either of the following, but we must do one or the other in order for char data to be properly usable:
-(a) Require UTF-8.
-(b) Recommend UTF-8, but provide a new attribute to record the encoding.

Which of these should we do?

Requiring UTF-8 is the best way to go -- see the point above about raw char* data.

However, as I noted, an _Encoding attribute was proposed (but not accepted?) years ago, and it seems the Python library is using that attribute [1] (while defaulting to utf-8). So that cat may be out of the bag.
Whether there are files out in the wild with _Encoding set, I don't know -- but if there are we probably don't want to make them invalid.

So, as much as I would like to simply require UTF-8, we probably need to say it's preferred, and the default, but other encodings can by used if defined in the _Encoding attribute.

However, for (global only?) attributes, rather than variable data, there is no way to set an _Encoding attribute. So UTF-8 in that case?

So:

For variables:

UTF-8 is preferred, and the default, but a different encoding can be used if the _Encoding attribute is set

For attributes:
UTF-8 is required.

As for the content of an _Encoding attribute, it would be nice to standardize that -- the best I could find for encodings is:

https://www.unicode.org/reports/tr17/#:~:text=The%20Unicode%20Standard%20has%20seven,32BE%2C%20and%20UTF%2D32LE.

Do we want to specify only those encodings? and only those spellings?

What about non-unicode encodings -- e.g. latin-1 ? If we can, it would be nice to keep it simple and only allow Unicode encodings (which gives you ascii, as a subset of utf-8).

Here's a list of what Python supplies out of the box:

https://docs.python.org/3/library/codecs.html#standard-encodings

The ones in there that are "all languages" (Unicode) I think is the same as the official Unicode list :-).

Note that there are big and little endian versions of the multi-byte encodings -- as netcdf "endianness is solved by writing all data in big-endian order" -- I think only the big endian forms should be allowed.

Finally, are the encoding spellings case-sensitive? e.g. the official spelling is "UTF-8" -- but Python, for instance, will accept: "utf-8", "UTF_8", etc.

For string data, I suppose the encoding isn't our concern, is it?

Unfortunately, it is :-(

I assume that netCDF strings support Unicode.

AFAICT, the only difference between a char array and a string is that the length of a char array is fixed -- that is, at the binary level, you get a char* (array of bytes) either way.

Turning that char* into a meaningful string requires that the encoding be known (unless you don't care what it means, and just want to pass it along, which is fine). If you want to compare it with other values you don't need to know the encoding, but you do need to know that the two you are comparing are in the same encoding. Hence why utf-8 everywhere would be easiest.

we can leave it to the netCDF interface of whatever language we use to deal with the encoding of the string data the user provides in that language.

Unfortunately, no -- there is no language-independent concept of a "Unicode String" there is only a string of bytes, and an encoding. So netcdf strings are no easier (but also no harder) than char arrays in that regard. The encoding must be specified.

The good news is that we can use exactly the same rules for char arrays and strings.

  • I sure wish this wasn't such a mess.

-Chris

[1] -- a note about Python -- internally, Python (v3+) uses a native "Unicode" string data type - a "string" of Unicode code points. The encoding is an internal implementation detail (and quite complex). This makes Unicode very easy to work with in Python, but there is no way to create a Python str from binary data without knowing the encoding. This created a LOT of drama around filenames in Python3 on *nix. On Unix, a filename is a char*, with very few restrictions -- the encoding may not be known (and may be inconsistent within a file system (!). Folks writing file processing utilities for Unix wanted to be able to work with these filenames without decoding them -- and if all you need to do is pass them around and compare them, then there is no need to know the encoding. It got ugly, and Python 3.4(?)+ finally introduced a work around.

@larsbarring
Copy link
Contributor

Recently, and in particular this week I have focussed my attention on the CF2024 Workshop that is currently running. Hence I have not followed the conversation in this issue (or any other for that matter). Hence could you please point me to where

We've agreed that CF should not prohibit characters permitted by the NUG, ....

as @JonathanGregory writes.

I am asking because in Discussion/#323 I suggested that we actually should "blacklist" i.e. prohibit a short list of ASCII characters, that have the potential to cause severe problems, for example how about including a space, , in variable names?

This is in fact allowed according to the NUG text @JonathanGregory refers to:

Beginning with versions 3.6.3 and 4.0, names may also include UTF-8 encoded Unicode characters as well as other special characters, except for the character '/', which may not appear in a name.

The potential problem caused by a variable name containing a space has been raised here, here, and here.

@ChrisBarker-NOAA
Copy link
Contributor

We've agreed that CF should not prohibit characters permitted by the NUG, ....

as @JonathanGregory #141 (comment).

I am asking because in Discussion/#323 I suggested that we actually should "blacklist" i.e. prohibit a short list of ASCII characters,

I had the same thought -- however, while I agree that we may want to blacklist characters -- that's not the point in this context, unless we want to blacklist ALL non-ascii characters.

I would rephrase Jonathan's point as:

"We've agreed that CF should not prohibit characters permitted by the NUG without good reason specific to CF" - :-)

And thinking more on this -- char arrays and strings really aren't all that different (from a CF point of view), so no reason to create different rules for char arrays than strings.

And we're going to need to support Unicode in general, so might as well not restrict it any more than we need to.

@Dave-Allured
Copy link
Contributor

I suggest that characters to avoid should be listed as a CF preference, not a strict prohibition. This could be worked on as a new issue or separate PR.

@JonathanGregory
Copy link
Contributor

JonathanGregory commented Sep 19, 2024

Dear @larsbarring

You wrote

Hence could you please point me to where "We've agreed that CF should not prohibit characters permitted by the NUG"

I'm surprised (looking through the records) to see how often we've discussed this point! The most recent occasion was your #468 Small update to text in section 2.3 regarding character sets, which installed the present wording. That went into CF 1.11, and I mentioned it this morning in my presentation.

Before that, the matter was discussed in @Dave-Allured's #237 Remove restrictions on netCDF object names. Despite the title of that issue, we didn't remove any restrictions then; rather, we decided that it was a recommendation, not a requirement, that variable, dimension, attribute and group names begin with a letter and be composed of letters, digits, and underscores. "It is recommended" replaced the previous wording, which had "should", to make it clearer.

This issue in turn referred to @davidhassell's #226 to Correct the wording in the conformance document section 2.3 "Naming Conventions". The conformance document had "must", which we changed to "should" to agree with the convention. There was some discussion of this point, and we decided that "should" was correct, also referring back to Trac Ticket 157 of 2017.

The convention had always said "should", and so does COARDS. My reading is that COARDS definitely intends a recommendation, not a requirement, because for requirements it says "will" rather than "should". It's more disciplined than CF has been, but perhaps the discussion on BCP-14 will change the situation!

Since other characters have never been prohibited, and we've discussed this many times, I don't think we ought to change it now. However, we can certainly introduce new recommendations against certain characters. I think that should be in a different issue from this one, as @Dave-Allured suggests.

Best wishes

Jonathan

@JonathanGregory
Copy link
Contributor

Dear Chris (reverting to the main subject of this issue)

You're quite right, you cannot have an attribute of an attribute in netCDF4. That would be the most convenient way to record the encoding of an individual char attribute. We could devise an inconvenient way, but let's not do that! I agree that this is a good argument to require char attributes to be encoded as UTF-8.

As for char variables, I agree that the documentation of netCDF4-python allows them to be stored in fixed-length arrays in any of the encodings supported by Python, default UTF-8, and it uses the netCDF _Encoding attribute to record the encoding. The _Encoding attribute isn't in the NUG, though. (Incidentally, I have discovered that we almost agreed to add the _Encoding attribute to CF in Trac ticket 159 in 2017.)

The HDF5 python library documentation says

HDF5 supports two string encodings: ASCII and UTF-8. We recommend using UTF-8 when creating HDF5 files ... If you need to write ASCII for compatibility reasons, you should ensure you only write pure ASCII characters. ... When creating a new dataset or attribute, Python str or bytes objects will be treated as variable-length strings, marked as UTF-8 and ASCII respectively.

This statement applies to both attributes and datasets. (I assume our netCDF-4 variables are HDF5 datasets - is that correct?) If my understanding is correct, it means that netCDF-4 string variables or attributes are encoded in UTF-8 when stored in netCDF-4 HDF5 files.

In summary, char attributes, string attributes and string variables must all be stored as ASCII or UTF-8. Only char variables could be stored with any other encoding of Unicode characters. Although netCDF-4 Python provides the _Encoding attribute to record the encoding, this is not supported by the netCDF library, NUG or CF. This implies that no-one has wanted to record Unicode in char variables in CF-netCDF files with any encoding other than ASCII, which works with netCDF-Classic, or UTF-8, which is what you'd guess for non-ASCII data. Therefore I don't think it would be backward-incompatible, or invalidate any existing data, if we decided that char variables must be ASCII or UTF-8, like the other three classes of string data.

Another reason to do that is because in CF we treat char arrays and stringss as different representations of the same thing. You argued earlier, in a different context, that it can sometimes be important for testing equality that the data should be equal bytewise, which suggests the same encoding ought to be used for char and string data.

I conclude that it would be reasonable, as well as simplest, to require and assume UTF-8, of which 7-bit ASCII is a subset, in all cases (as in my earlier comment). But perhaps my reasoning is faulty.

Best wishes

Jonathan

@ChrisBarker-NOAA
Copy link
Contributor

This implies that no-one has wanted to record Unicode in char

I wish we could assume that!

But I don't think we can. If the Python lib supports '_Encoding', then someone, somewhere may have used it :-(

That being said, it was never compliant with a documented standard.

So yes, I think we should say "UTF-8" everywhere.

(Which allows ascii)

However, there is one more complication: what about other "ANSI" (I.e. one byte per char) encodings? E.g. Latin-1 ? Back in the day, any null-terminated char* was accepted-- I'm sure there are some of those files in the wild.

Perhaps they were never CF compliant, but should we say something about what to do with them?

Note: non-ascii one-byte encodings will often error out when decoded as utf-8 :-(

-CHB

@JonathanGregory
Copy link
Contributor

Dear Chris

Yes, someone may have used the netCDF4-python _Encoding attribute with some encoding other than UTF-8, but - as you say - it was never supported by CF, nor the netCDF library. As far as CF is concerned, there is currently no standard for interpreting char variables. To clarify the position for the future, we can state that it must be UTF-8 as a CF convention.

For CF char variable data before 1.12, we can certainly say

  • The data is CF-compliant, since CF didn't prohibit any code, but CF can't help with interpreting it, because there wasn't any CF convention for it.

  • If there is an _Encoding attribute, you could try to use it.

  • Otherwise, it's a good guess that 00-7F are ASCII.

Beyond that, it's less obvious how far we should go in providing advice. Some possibilities are (based on my reading of the UTF-8 page in wikipedia):

  • C0-F7 are probably the first byte of a UTF-8 sequence, but it depends on whether the following byte(s) are consistent with that.

  • F8-FF do not occur in UTF-8 in any position. That's interesting. They could be Latin-1 (ISO/IEC 8859-1).

  • A0-BF might be Latin-1. They are also possible for the second to fourth byte of a UTF-8 sequence.

  • 80-9F are not used by Latin-1, nor by the first byte of a Unicode sequence, though they can occur in subsequent bytes.

I think that's all too complicated though. I suggest we should stick with a simpler final point:

  • If it's 80-FF, it's not ASCII, and must be some other encoding, possibly UTF-8 or Latin-1, but some codes could not be either.

What do you think?

Cheers

Jonathan

@DocOtak
Copy link
Member

DocOtak commented Sep 23, 2024

The netcdf library docs appendix B says:

Note on char data: Although the characters used in netCDF names must be encoded as UTF-8, character data may use other encodings. The variable attribute “_Encoding” is reserved for this purpose in future implementations.

netcfd4-python is using an _Encoding attribute for things. In my own data there are string variables and either xarray or netcdf4-python is setting the _Encoding attribute on each of these variables to utf-8. I've never looked into which one.

Charset detection is a minefield that we should stay away from, see all the external links on that wikipedia page. The general advice I've received in the past about encoding is basically "you cannot guess, you must be told".

UTF-8 is a miracle: every unicode point represented, backwards compatible with ASCII (8-bit), self synchronizing. I think CF should encourage its use in netCDF for string data and strongly recommend against using anything else for new data.

Here is my opinion for the CF recommendation:

  • Use the _Encoding attribute, its value should come from the IANA registry of character sets. Note: this list is case insensitive, iana prefers "us-ascii" but seems to allow "ascii" though the RFCs that define this.
  • If there is no _Encoding assume UTF-8 (and therefore ASCII included)
  • If there are invalid utf-8 byte sequences (error decoding), or the decoded string has mojibake, you need to ask the data producer or start guessing. Noting that latin-1 is quite common and a good place to start.

@JonathanGregory
Copy link
Contributor

Thanks for your comment, @DocOtak. I think we mostly agree. I didn't say, but I do agree, that the construction of UTF-8 is clever in retaining backward-compatibility with 7-bit ASCII. I also agree that it's not our business to recommend in detail how the user could try to work out what character set was used; it's too complicated.

There are some things in what you said that I'm not sure about:

  • For new data, you say that CF should recommend the use of UTF-8 (including 7-bit ASCII). Why not require? Maybe you'd argue there's no point in requiring it because it's unenforceable? (Some non-UTF-8 sequences can be detected because they're not valid UTF-8, but others would be misinterpreted as valid UTF-8.)

  • You say that xarray or netcdf4-python is setting the _Encoding attribute for string variables. Do you mean variables of string type, or char type? If I understand the documentation correctly, netcdf4-python uses _Encoding only for char variables. I can't find a clear statement about the HDF5 file format itself, but the HDF5 python lib seems clear that HDF5 vlen strings are always UTF-8. That must apply to netCDF-4 string variables too, in which case there's no function for the _Encoding attribute with string variables. Have I misunderstood?

I think we agree about the treatment of existing data:

  • There was no CF convention previously. Therefore the user has to do their best to work it out. We recommend:

  • If there is an _Encoding attribute, try to use it (we can give advice on what it should contain). This applies only for variables, since attributes don't have attributes.

  • Otherwise, try UTF-8.

  • If that doesn't work, guess a different character set, such as Latin-1, or ask the data-writer.

Does that look right to you and @ChrisBarker-NOAA?

Cheers

Jonathan

@ChrisBarker-NOAA
Copy link
Contributor

For new data, you say that CF should recommend the use of UTF-8 (including 7-bit ASCII). Why not require?

I think we should require.

Maybe you'd argue there's no point in requiring it because it's unenforceable?

Nothing is enforcable, is it?

You say that xarray or netcdf4-python is setting the _Encoding attribute for string variables. Do you mean variables of string type, or char type?

I sure as heck hope it's setting it for both string and char. They are both a string of bytes. without encoding, there's no way to know what's in them.

However, if they are NOT setting it for string, but rather assuming utf-8, then that's OK :-)

the HDF5 python lib seems clear that HDF5 vlen strings are always UTF-8. That must apply to netCDF-4 string variables too,
the HDF5 python lib and netCDF4 libs are independent code, so ???

Anyway, I think this is the story:

  • We should treat string and char the same way

  • Up until now, there was no CF specification for encoding in char or string data -- so anything was CF compliant. back to the bad old days of guessing, mojibake and the rest. But this means that any encoding is compliant with CF <= 1.11. nothing we can do about that.

  • But only utf-8 is compliant with CF>=1.12 (if we get this in)

So if you put CF version 1.12, you MUST use utf-8 encoding (which is ascii, if you only use ascii characters)

That's all we need to say for folks writing files.

For folks reading CF <= 1.11 -- char and strings can be any encoding. But it's worth mentioning:

  • If there is an _Encoding attribute, then use that.
  • if there isn't, then do whatever you've been doing for years, which might be:
    • try an encoding detection library (e.g. Python chardet)
    • try utf-8 (which will work with ascii)
    • try latin-1 [*]
    • go to the provider and ask.

I'm not sure we need to say anything more than "do what you always did", but maybe it's worth providing a little advice?

[*] I'm a big fan of latin-1 -- at least in Python, it won't error out on ANY data. It might give you garbage, but it won't raise an error. And if it's a ascii-compatible encoding, it will at least get that part right. Which is kind of helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors
Projects
None yet
16 participants