-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for attributes of type string #141
Comments
I am generally in support of this I suggest not including a constraint for scalar strings, simply because it seems redundant. I think that existing CF language strongly implies single strings in most cases of CF defined attributes. |
How different is reading values from a Prefixing the bytes of an UTF-8 encoded string with the BOM sequence is an odd practice. Although it is permitted, afaik, it is not recommended. Since what gets stored are always the bytes of one string in some encoding, assuming UTF-8 always should take care of the ASCII character set, too. This could cause issues if someone used other one-byte encodings (e.g. ISO 8859 family) but I don't see how such cases could be easily resolved. Stroring Unicode strings using the |
This issue and issue #139 are intertwined. There may be overlapping discussion in both. |
@ajelenak-thg So I did some digging. I wrote a file with IDL and read it with C. There are no BOM prefixes. I guess some languages (such as Python) make assumptions one way or another about string attributes and variables, but it appears that it's all pretty straightforward. |
@ajelenak-thg I agree that we should state that char attributes and variables should contain only ASCII characters. |
@Dave-Allured When you say "CF-controlled attributes", are you referring to any values they may have, or to values that are from controlled vocabularies? |
@JimBiardCics, by "CF-controlled attributes", I mean CF-defined attributes within "the controlled vocabulary of CF" as you described above. By implication I am referring to any values they may have, including but not limited to values from controlled vocabularies. A warning about avoiding data type |
The restriction that Therefore I suggest no mention of a character set restriction, outside of the CF controlled vocabulary. Alternatively you could establish the default interpretation of string data (both |
Hi all, I wasn't quite able to form this into a coherent paragraphs so here are some things to keep in mind re: UTF8 vs other encodings:
My personal recommendation is that the only encoding for text in CF netCDF be UTF8 in NFC with no byte order mark. For attributes where there is desire to restrict what is allowed (though controlled vocabulary or other limitations), the restriction should be specified using unicode points, e.g. "only printing characters between U+0000 and U+007F are allowed in controlled attributes". Text which is in controlled vocabulary attributes should continue to be char arrays. Freeform attributes (mostly those in 2.6.2. Description of file contents), could probably be either string or char arrays. |
@DocOtak, you said "the netCDF classic format required UTF8 to be in Normalization Form Canonical Composition (NFC)". This restriction is only for netCDF named objects, i.e. the names of dimensions, variables, and attributes. There is no such restriction for data stored within variables or attributes. |
@Dave-Allured yes, I reread the section, object names does appear to be what it is restricting. Should there be some consideration of specifying a normalization for the purposes of data in CF netcdf? Text encoding probably deserves its own section in the CF document, perhaps under data types. The topic of text encoding can be very foreign to someone who thinks that "plain text" is a thing that exists in computing. |
@DocOtak, for general text data, I think UTF-8 normalization is more of a best practice than a necessity for CF purposes. Therefore I suggest that CF remain silent about that, but include it if you feel strongly. Normalization becomes important for efficient string matching, which is why netCDF object names are restricted. |
@Dave-Allured I don't know enough about the consequences of requiring a specific normalization. There is some interesting information on the unicode website about normalization. Which suggests that over 99% of unicode text on the web is already in NFC. Also interesting is that combining NFC normalized strings may not result in a new string that is normalized. It is also stated in the FAQ that "Programs should always compare canonical-equivalent Unicode strings as equal", so it's probably not an issue as long as the controlled vocabulary attributes have values with code points in the U+0000 and U+007F range (control chars excluded). |
@Dave-Allured and @DocOtak,
|
@hrajagers Thanks for the pointer to NUG Appendix A. It's interesting to see in that text that character array, character string, and string are used somewhat interchangeably. I'm curious to know if the NUG authors looked at this section in light of allowing |
I think we are making good progress on this. I checked the Appendix A table of CF attributes and I think the following attributes can be allowed to hold
All the other attributes should hold |
@ajelenak-thg Are you suggesting the other attributes must always be of type |
Based on the expressed concern so far for backward compatibility I suggested the former: always be of type |
On the string encoding issue, CF data can be currently stored in two file formats: NetCDF Classic, and HDF5. String encoding information cannot be directly stored in the netCDF Classic format and the spec defines a special variable attribute In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. Both |
Yes, NUG Appendix A literally allows only Personally I think |
Actually the ASCII/UTF-8 restriction is not enforced by the HDF5 library. This is used intentionally by netcdf developers to support arbitrary character sets in netcdf-4 data type For example, this netcdf-4 file contains a |
Dear Jim Thanks for addressing these issues. In fact you've raised two issues: the use of strings, and the encoding. These can be decided separately, can't they? On strings, I agree with your proposal and subsequent comments by others that we should allow For the attributes whose contents are standardised by CF e.g. I recall that at the meeting in Reading the point was made that arrays would be natural for Best wishes Jonathan |
@JonathanGregory I agree with you. I think it would be fine to leave Regarding the encoding, it seems to me that we could avoid a lot of complexity for now by making a simple requirement that all CF-defined terms and whitespace delimiters in string-valued attributes or variables be composed characters from the ASCII character set. It wouldn't matter if people used Latin-1 (ISO-8859-1) or UTF-8 for any free text or free text portions of contents, because they both contain the ASCII character set as a subset. The parts that software would be looking for would be parseable. |
@JonathanGregory Another use-case (that I think came up during the Reading meeting) had the |
Regarding the encoding, I agree that for attribute contents which are standardised by CF it is fine to restrict ourselves to ASCII, in both Regarding arrays of strings, I realise I wasn't thinking clearly yesterday, sorry. As we've agreed, Happy weekend - Jonathan |
I meant to write, I don't see the particular value for the use of |
@JonathanGregory The use of an array of strings for history would simplify denoting where each entry begins and ends as entries are added, appending or prepending a new array element each time rather than the myriad different ways people do it now. This would actually be a good thing to standardize. I think we can just not mention string array attributes right now. The multi-valued attributes (other than history, perhaps) pretty much all specify how they are to be formed and delimited. |
I think we can just not mention string array attributes right now.
Do we currently allow array of CHAR (i.e. 2D array) for attributes?
According to the netcdf docs;
The current version treats all attributes as vectors; scalar values are
treated as single-element vectors.
Which makes me think no, that’s not possible.
I think allowing the string type should not change what’s allowable.
BTW, I suspect some client software (e.g. py_netCDF4) treat char and string
the same ....
…-CHB
|
Let me throw a big wrench into this argument about not allowing string
arrays.
1. I would prefer a consistent decision and standard about the use of
char vs. string so a user does not need to know where to use char
array, scalar string, or string arrays.
2. Use of string arrays with flag_meanings (not sure it would be needed
with flag_values?) will solve many problems for my program to
actually merge our standards with CF. Currently with char arrays we
need to connect all words for a single flag by underscores for space
delimiting. Many of our variable names and attribute names contain
underscores. So when the flag description is parsed and changed to
be more human readable all the attribute and variable names are not
preserved. Automated tools can no longer replace attribute or
variable names with the attribute or variable value. We do this a
lot. We also have lengthy descriptions for our flag_meanings. I
would prefer to use flag_mask, flag_values and flag_meanings as that
general method is better than the one we currently employ.
3. I do see the benefit of storing history as string arrays. Without
checking date stamps I can see how many times the file has been
modified by checking the list length. It also removes any ambiguity
about separators in the history attribute which differs from the CF
standard of space separation and is often institution defined. The
current definition for history attribute is "List of the
applications that have modified the original data." In the python
world the use of "list" is different than the intended definition.
4. I'm starting to get a lot of more complicated data that are
multidimensional but do not share the same units. We would need to
work with udunits, but Cf/Radial is proposing a new standard for
complex data which often have different units for different index in
a second dimension. If we allowed string arrays in units we could
store complex data or other data structures more native to the
intended use since uduints interprets space characters as
multiplication not a delimiter.
5. missing_value or _FillValue currently allow one value. For string
type data allowing sting arrays to have multiple fill values which
would allow numeric data also have multiple fill values defined,
which I'm sure there are many data sets that have multiple fill
values used, but not defined correctly in the data file.
6. valid_range can be used with string data type
7. Conventions attribute could group multiple indicators with the same
class of conventions. For example ["CF-1.7", "Cf/Radial
instrument_parameters radar_parameters", "ARM-1.3"]
8. and on and on ....
I'm not suggesting the use of all these use cases, but this relatively
small change can go a long way to improve the standard and future use of
the data.
OK, I've made my case I'll be quite now.
Ken
…On 2018-7-27 09:23, JimBiardCics wrote:
@JonathanGregory
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JonathanGregory&d=DwMFaQ&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=Vm7o2ZGxPkkqRuPs8nVMVQ&m=8yXRwSE3siXwVZGSqDHhbz4T34VwzI3u8GXmHkJaWkY&s=d4qYLgaugDM0kdWoZHbgieEpVU-Xg_SJ1d1F_dbBs2M&e=>
The use of an array of strings for history would simplify denoting
where each entry begins and ends as entries are added, appending or
prepending a new array element each time rather than the myriad
different ways people do it now. This would actually be a good thing
to standardize.
I think we can just not mention string array attributes right now. The
multi-valued attributes (other than history, perhaps) pretty much all
specify how they are to be formed and delimited.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cf-2Dconvention_cf-2Dconventions_issues_141-23issuecomment-2D408452242&d=DwMFaQ&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=Vm7o2ZGxPkkqRuPs8nVMVQ&m=8yXRwSE3siXwVZGSqDHhbz4T34VwzI3u8GXmHkJaWkY&s=SzizwDsedBZ_n_qPzSCZ1OVJv5eli4zFSJXKogaOAtE&e=>,
or mute the thread
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AH4NvmnCKFC7HSpXQx-5FMi6Yfc-5F1HSfBaks5uKzBvgaJpZM4VbMvb&d=DwMFaQ&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=Vm7o2ZGxPkkqRuPs8nVMVQ&m=8yXRwSE3siXwVZGSqDHhbz4T34VwzI3u8GXmHkJaWkY&s=F3efrdVvqb932q5mp7D_eux9BSLztraUFgqR52IYak0&e=>.
--
Kenneth E. Kehoe
Research Associate - University of Oklahoma
Cooperative Institute for Mesoscale Meteorological Studies
ARM Climate Research Facility - Data Quality Office
e-mail: [email protected] | Office: 303-497-4754 | Cell: 405-826-0299
|
@DocOtak @zklaus Sorry if I've pulled the discussion off track. The question of exactly why NUG worded things the way they did is intriguing, but I think Klaus is right that we shouldn't get wrapped around that particular axle in this issue — particularly if we are going to split encoding off into a different issue. I think the take-away is that our baseline is "sane utf-8 unicode" for attributes of type NC_STRING and ASCII for attributes of type NC_CHAR (those created with the C function nc_put_att_text.) |
I agree and would go one small step further: UTF-8 is only an encoding, so we should just say "unicode" for strings. If we need to restrict that, say to disallow underscore in the beginning or to save a separation character like space in attributes right now, we should do so at the character level, possibly using categories as introduces by @ChrisBarker-NOAA above. |
We could do that if and only if netcdf itself was clear about how Unicode is encoded in files. Which it is for variable names, though not so sure it is anywhere else. But even so, once the encoding has been specified, then yes, talking about Unicode makes sense. Agreed, it's not for this discussion, but:
But it's a pretty contorted way to describe it -- but that's netcdf's problem :-) |
Ah yes, I see what you mean, you are right: Always speaking about UTF-8, multi-byte here isn't referring to the possibility of having several bytes encode one code point, but to actual code points with more than one byte, thus excluding the one-byte code points which are exactly the first 128 ASCII characters. Then they allow back in specific ASCII characters. |
Dear all The issue was opened in 2018 and has seen a long discussion, but no further contributions since 2020. It has been partly superseded, in that CF now permits string-valued attributes to be either a scalar string or a 1D character array (see Sect 2.2). Apart from that, it seems to me that the discussion was mostly concerned with three subjects:
I propose that this issue should be closed as dormant if no-one resumes discussion on Q2 or Q3 within the next three weeks, before 14th September. Cheers Jonathan |
Thanks for trying to close this out :-)
I jsut looked, and all I see is this under naming: "...is more restrictive than the netCDF interface which allows almost all Unicode characters encoded as multibyte UTF-8 characters """ So yes, I think it's good to be clear there -- maybe it's well defined by netcdf, but it doesn't hurt to be explicit, if repetivive. I am correct to say that all strings in netCDF are Unicode, encoded as UTF-8 ? Whether that's true or not for netcdf -- I think it should be true for CF, and we should say that explicitly in any case.
I would say that we should not restrict these otherwise not-restricted atributes. I'm not sure if that's pursuing it or not pursuing it -- I presume the default is no restrictions? |
Hmm -- not sure where this fits, but it's related: IIUC, CF now allows either the new vlen strings, or the "tradoitional" char arrays. The trick is that UTF-8 is not a a one-char-per-codepoint encoding. Could we say that you can only use Unicode (UTF-8) with vlen strings, and char arrays can only hold ASCII? or is the cat way to far out of the bag for that? Probably - could we at least encourage vlen strings for non-ascii text? |
I don't know either.
I think so as well. That would go sensibly in Sect 2.2 "Data types". We've already said in 2.2 that scalar Cheers Jonathan |
No-one said they wanted to resume Q1 or Q3 within three weeks, but @ChrisBarker-NOAA and I agreed that it would be useful to clarify that strings stored in variables or attributes should be Unicode characters (Q2). To do that, I propose that we replace the first 1.5 sentences of the second para of sect 2.2 "Data Types", which currently reads
with
Also, I suggest inserting the clarification "which has variable length", in this sentence in the first paragraph:
Does that look all right, @ChrisBarker-NOAA, @zklaus, anyone else? I believe this is no change to the convention, just clarification of the text, so I'm going to relabel this issue as a |
As PR #543 attempts to clarify a bit about Unicode, I thought I'd post here. I stared commenting on the PR, but realized I had way too much to say for a PR, so I'm putting it here. NOTE: maybe this should be a different issue -- specifically about Unicode in CF -- but I'm putting it here for now -- we can opy to a new issue if need be. First some definitions/descriptions discussion about Unicode and strings.
2: There is no such thing as a Unicode String (except where defined by a programming language, e.g. Python). When stored in memory or a file strings, Unicode or not, are stored as bytes, and the relationship between the bytes and the code points is defined by an encoding. Without an encoding, there is no clear way to define what a bunch of bytes means, or in reverse, how to store a particular set of code points.
UTF-8 is the most common Unicode encoding for sotage of text in files, or passing over the internet (via https, or ...). UTF-16 is used internally by Windows and Java (I think). Anyway -- unless one wants to use UCS-32 (which is what the numpy Unicode type uses), which most folks don't want to use for file storage -- it's pretty wasteful of space for virtually all text) then a variable-length encoding is required. And a Char arrays and strings in netcdf.So this brings us to the topic at hand -- in netcdf3 the only what to store text was in arrays of type With netcdf4, a So: as far as the netcdf spec is concerned, the only difference between a That's all I could find in the netCDF docs. Nothing about Unicode or encodings, or ... Which means that as far as the netcdf spec is concerned, you can put anything in either data type. Note that a So enter Unicode: as above, to store a "Unicode String" i.e. collection of code points, requires that the string be encoded, resulting in a set of bytes that can be stored in, you guessed it a char*. (on Windows, the standard encoding is UTF-16, so a So as far as netcdf is concerned, you can stuff Unicode text into either a Note that I did find this discussion:
I also don't know what it does for attributes, because they can't have another attribute to store the Anyway -- as this doesn't seem to be defined by any published spec, I hope we can define it for CF. My proposal: In pretty much any context:
That's it -- pretty simple, really :-) Points to consider:
My thought -- as much as I'd love to be fully restrictive to make things simpler for everyone, the cat's probably out of the bag. So we may have to require as few restrictions as possible (e.g. So -- enough words for you? -- back in the day, a char* would be an ASCII or ANSI encoding string (null terminated), and all was good an simple. |
Dear Chris Thanks for the research and your useful exposition of the complexity of the issue. I was hoping that we could add a couple of sentences on this subject, rather than a new chapter. :-) NetCDF allows Unicode characters in names (of dimensions, variables and attributes). The relevant text from the NUG v1.1 is as follows. (By the way, this quotation indicates that Unidata also regard it as OK to refer to Unicode "characters" instead of "codepoints", in the interest of easy understanding.)
We've agreed that CF should not prohibit characters permitted by the NUG, although we recommend a more restricted list of characters in sect 2.3:
In the previous discussion on this issue, an important point was made, that many CF attributes identify netCDF variables or attributes by name e.g.
On your final point 2, in my text above I proposed that we should require UTF-8 encoding for
Which of these should we do? For Best wishes Jonathan |
There's still hope :-)
Darn that google! -- I could have saved a lot of writing if I'd found that.
OK -- very good -- UTF-8 it is -- whew!
That's clear then.
So CF recommends, but does not require, ASCII-only for names -- OK then, that helps, but doesn't avoid the issue :-).
Darn -- but it is what it is.
Also darn. :-)
Makes sense to me. And, in fact, there is a very strong justification for this:
This is critical, as many (most?) programming environments (C, FORTRAN) only work natively with raw binary data (e.g. char*). So it's pretty critical that all char (and string) data are encoded the same way.
And guessing is never good :-(
Requiring UTF-8 is the best way to go -- see the point above about raw However, as I noted, an So, as much as I would like to simply require UTF-8, we probably need to say it's preferred, and the default, but other encodings can by used if defined in the However, for (global only?) attributes, rather than variable data, there is no way to set an So: For variables: UTF-8 is preferred, and the default, but a different encoding can be used if the For attributes: As for the content of an Do we want to specify only those encodings? and only those spellings? What about non-unicode encodings -- e.g. latin-1 ? If we can, it would be nice to keep it simple and only allow Unicode encodings (which gives you ascii, as a subset of utf-8). Here's a list of what Python supplies out of the box: https://docs.python.org/3/library/codecs.html#standard-encodings The ones in there that are "all languages" (Unicode) I think is the same as the official Unicode list :-). Note that there are big and little endian versions of the multi-byte encodings -- as netcdf "endianness is solved by writing all data in big-endian order" -- I think only the big endian forms should be allowed. Finally, are the encoding spellings case-sensitive? e.g. the official spelling is "UTF-8" -- but Python, for instance, will accept: "utf-8", "UTF_8", etc.
Unfortunately, it is :-(
AFAICT, the only difference between a Turning that
Unfortunately, no -- there is no language-independent concept of a "Unicode String" there is only a string of bytes, and an encoding. So netcdf strings are no easier (but also no harder) than char arrays in that regard. The encoding must be specified. The good news is that we can use exactly the same rules for
-Chris [1] -- a note about Python -- internally, Python (v3+) uses a native "Unicode" string data type - a "string" of Unicode code points. The encoding is an internal implementation detail (and quite complex). This makes Unicode very easy to work with in Python, but there is no way to create a Python str from binary data without knowing the encoding. This created a LOT of drama around filenames in Python3 on *nix. On Unix, a filename is a |
Recently, and in particular this week I have focussed my attention on the CF2024 Workshop that is currently running. Hence I have not followed the conversation in this issue (or any other for that matter). Hence could you please point me to where
I am asking because in Discussion/#323 I suggested that we actually should "blacklist" i.e. prohibit a short list of ASCII characters, that have the potential to cause severe problems, for example how about including a space, This is in fact allowed according to the NUG text @JonathanGregory refers to:
The potential problem caused by a variable name containing a space has been raised here, here, and here. |
I had the same thought -- however, while I agree that we may want to blacklist characters -- that's not the point in this context, unless we want to blacklist ALL non-ascii characters. I would rephrase Jonathan's point as: "We've agreed that CF should not prohibit characters permitted by the NUG without good reason specific to CF" - :-) And thinking more on this -- And we're going to need to support Unicode in general, so might as well not restrict it any more than we need to. |
I suggest that characters to avoid should be listed as a CF preference, not a strict prohibition. This could be worked on as a new issue or separate PR. |
Dear @larsbarring You wrote
I'm surprised (looking through the records) to see how often we've discussed this point! The most recent occasion was your #468 Small update to text in section 2.3 regarding character sets, which installed the present wording. That went into CF 1.11, and I mentioned it this morning in my presentation. Before that, the matter was discussed in @Dave-Allured's #237 Remove restrictions on netCDF object names. Despite the title of that issue, we didn't remove any restrictions then; rather, we decided that it was a recommendation, not a requirement, that variable, dimension, attribute and group names begin with a letter and be composed of letters, digits, and underscores. "It is recommended" replaced the previous wording, which had "should", to make it clearer. This issue in turn referred to @davidhassell's #226 to Correct the wording in the conformance document section 2.3 "Naming Conventions". The conformance document had "must", which we changed to "should" to agree with the convention. There was some discussion of this point, and we decided that "should" was correct, also referring back to Trac Ticket 157 of 2017. The convention had always said "should", and so does COARDS. My reading is that COARDS definitely intends a recommendation, not a requirement, because for requirements it says "will" rather than "should". It's more disciplined than CF has been, but perhaps the discussion on BCP-14 will change the situation! Since other characters have never been prohibited, and we've discussed this many times, I don't think we ought to change it now. However, we can certainly introduce new recommendations against certain characters. I think that should be in a different issue from this one, as @Dave-Allured suggests. Best wishes Jonathan |
Dear Chris (reverting to the main subject of this issue) You're quite right, you cannot have an attribute of an attribute in netCDF4. That would be the most convenient way to record the encoding of an individual As for The HDF5 python library documentation says
This statement applies to both attributes and datasets. (I assume our netCDF-4 variables are HDF5 datasets - is that correct?) If my understanding is correct, it means that netCDF-4 In summary, Another reason to do that is because in CF we treat I conclude that it would be reasonable, as well as simplest, to require and assume UTF-8, of which 7-bit ASCII is a subset, in all cases (as in my earlier comment). But perhaps my reasoning is faulty. Best wishes Jonathan |
I wish we could assume that! But I don't think we can. If the Python lib supports '_Encoding', then someone, somewhere may have used it :-( That being said, it was never compliant with a documented standard. So yes, I think we should say "UTF-8" everywhere. (Which allows ascii) However, there is one more complication: what about other "ANSI" (I.e. one byte per char) encodings? E.g. Latin-1 ? Back in the day, any null-terminated char* was accepted-- I'm sure there are some of those files in the wild. Perhaps they were never CF compliant, but should we say something about what to do with them? Note: non-ascii one-byte encodings will often error out when decoded as utf-8 :-( -CHB |
Dear Chris Yes, someone may have used the netCDF4-python For CF
Beyond that, it's less obvious how far we should go in providing advice. Some possibilities are (based on my reading of the UTF-8 page in wikipedia):
I think that's all too complicated though. I suggest we should stick with a simpler final point:
What do you think? Cheers Jonathan |
The netcdf library docs appendix B says:
netcfd4-python is using an _Encoding attribute for things. In my own data there are string variables and either xarray or netcdf4-python is setting the _Encoding attribute on each of these variables to Charset detection is a minefield that we should stay away from, see all the external links on that wikipedia page. The general advice I've received in the past about encoding is basically "you cannot guess, you must be told". UTF-8 is a miracle: every unicode point represented, backwards compatible with ASCII (8-bit), self synchronizing. I think CF should encourage its use in netCDF for string data and strongly recommend against using anything else for new data. Here is my opinion for the CF recommendation:
|
Thanks for your comment, @DocOtak. I think we mostly agree. I didn't say, but I do agree, that the construction of UTF-8 is clever in retaining backward-compatibility with 7-bit ASCII. I also agree that it's not our business to recommend in detail how the user could try to work out what character set was used; it's too complicated. There are some things in what you said that I'm not sure about:
I think we agree about the treatment of existing data:
Does that look right to you and @ChrisBarker-NOAA? Cheers Jonathan |
I think we should require.
Nothing is enforcable, is it?
I sure as heck hope it's setting it for both However, if they are NOT setting it for
Anyway, I think this is the story:
So if you put CF version 1.12, you MUST use utf-8 encoding (which is ascii, if you only use ascii characters) That's all we need to say for folks writing files. For folks reading CF <= 1.11 -- char and strings can be any encoding. But it's worth mentioning:
I'm not sure we need to say anything more than "do what you always did", but maybe it's worth providing a little advice? [*] I'm a big fan of latin-1 -- at least in Python, it won't error out on ANY data. It might give you garbage, but it won't raise an error. And if it's a ascii-compatible encoding, it will at least get that part right. Which is kind of helpful. |
Attributes with a type of string are now possible with netCDF-4, and many examples of attributes with this type are "in the wild". As an example of how this is happening, IDL creates an attribute with this type if you select its version of
string
type instead ofchar
type. It seems that people often assume thatstring
is the correct type to use because they wish to store strings, not characters.I propose to add verbiage to the Conventions to allow attributes that have a type of
string
. There are two ramifications to allowing attributes of this type, the second of which impacts string variables as well.string
attribute can contain 1D atomic string arrays. We need to decide whether or not we want to allow these or limit them (at least for now) to atomic string scalars. Attributes with arrays of strings could allow for cleaner delimiting of multiple parts than spaces or commas do now (e.g. flag_values and flag_meanings could both be arrays), but this would be a significant stretch for current software packages.string
attribute (and astring
variable) can contain UTF-8 Unicode strings. UTF-8 uses variable-length characters, with the standard ASCII characters as the 1-byte subset. According to the Unicode standard, a UTF-8 string can be signaled by the presence of a special non-printing three byte sequence known as a Byte Order Mark (BOM) at the front of the string, although this is not required. IDL (again, for example) writes this BOM sequence at the beginning of every attribute or variable element of typestring
.Allowing attributes containing arrays of strings may open up useful future directions, but it will be more of a break from the past than attributes that have only single strings. Allowing attributes (and variables) to contain UTF-8 will free people to store non-English content, but it might pose headaches for software written in older languages such as C and FORTRAN.
To finalize the change to support
string
type attributes, we need to decide:string
attributes and (by extension) variables?Now that I have the background out of the way, here's my proposal.
Allow
string
attributes. Specify that the attributes defined by the current CF Conventions must be scalar (contain only one string).Allow UTF-8 in attribute and variable values. Specify that the current CF Conventions use only ASCII characters (which are a subset of UTF-8) for all terms defined within. That is, the controlled vocabulary of CF (standard names and extensions, cell_methods terms other than free-text elements of comments(?), area type names, time units, etc) is composed entirely of ASCII characters. Free-text elements (comments, long names, flag_meanings, etc) may use any UTF-8 character.
Trac ticket: #176 but that's just a placeholder - no discussion took place there (@JonathanGregory 26 July 2024)
The text was updated successfully, but these errors were encountered: