Add support for variables of type string #139

JimBiardCics · 2018-07-19T20:26:57Z

This change is the outcome of discussions at the 2018 netCDF/CF Workshop meeting in Reading, UK. The proposal is to change the language of Chapter 2 to include the string type as a valid variable type, and to modify the description of variables containing text to include variables with a type of string.

The changes can be seen in pull requests #140 and cf-convention/Conformance#7
Placeholder Trac ticket at https://cf-trac.llnl.gov/trac/ticket/175#ticket

The text was updated successfully, but these errors were encountered:

Dave-Allured · 2018-07-19T22:22:02Z

It is probably better to debate changes in GitHub issues rather than in pull requests. Therefore I am re-posting my comment about character arrays here (next comment), and removing it from the pull request #138.

Also I was once told that there is a preference for discussing something in issues first, then create the pull request later. This issue was started in the reverse order, thus I was a bit confused. I am a relative newbie to GitHub, so please educate me if I am off base about this. Just trying to do The Right Thing.

Dave-Allured · 2018-07-19T22:23:40Z

Variable length strings can be stored in character arrays by the use of either null padding or blank padding. The storage is fixed length, but the purpose is variable length. The following sentence in the original pull request conflicts with this usage:

"If the character array option is chosen, the strings in a given array are defined to be equal in length."

This sentence is not essential to the description of strings within character type. It is an unneeded constraint. This sentence is also not necessary for the introduction of string data type. Therefore I request this sentence simply be removed. Thanks.

ChrisBarker-NOAA · 2018-07-19T22:52:17Z

If it is discussing a suggested localized suggested change to text -- a PR is the perfect place for it.

Which this is, yes?

Is this a contentious issue? or just trying to come up with the best wording?

-CHB

Dave-Allured · 2018-07-19T23:56:31Z

I will bear that in mind for future issues. On this one, there is substance in my proposed wording. Let's just keep it here and see if there is any objection to my view of variable length string storage.

JonathanGregory · 2018-07-20T13:57:58Z

Dear all

I support this change being made, but I think it may require changes in more than one place in the conventions document. It needs a general statement somewhere early that an array of character data with a trailing dimension of maximum string length is equivalent to an array of strings without that dimension. Anywhere in the text which mentions the use of a character array to contain a string should be modified, and also any examples which show this could mention it as an alternative. Following the discussion at the meeting, I think this applies only to character variables, not character attributes. Is that correct? Possibly changes are needed in the conformance document too - I haven't looked.

Cheers

Jonathan

JimBiardCics · 2018-07-20T14:09:50Z

I was hoping that stating both ways of forming string variables were valid and effectively interchangeable would be sufficient. I'm not sure where string variables show up in the rest of the document. Any assistance with tracking those places down would be appreciated.

I'm working on another change to address string attributes. The issue there is slightly different.

JonathanGregory · 2018-07-20T14:37:56Z

Dear Jim

I think that would make the document somewhat less useful, especially for anyone who looks up something in the middle of it with a presumption that strings should be allowed, and doesn't find them shown in text or examples. Most users of the document don't sit down and read it from start to finish. Moreover I feel that we must check carefully that this won't cause problems - I don't think it will, but it is better to be safe. In fact the work involved in making this change is mostly that of reading the documents to check what might be affected, I suspect.

Best wishes

Jonathan

JimBiardCics · 2018-07-20T14:47:46Z

@Dave-Allured, I've modified the wording. See what you think.

DocOtak · 2018-07-20T18:30:24Z

In regards to the other sections concern raised by @JonathanGregory, the chapter 6 section 1 on "Labels" has some language talking about character arrays. This was the only other part of the document I could find which talks explicitly about strings in variable data.

JimBiardCics · 2018-07-20T18:32:28Z

Appendix H contains quite a few examples using string variables of type char, and chapters 6 and 7 each have one. What, if anything, do people think should be done with those?

JimBiardCics · 2018-07-20T18:33:00Z

There are small changes in chapters 5 and 7 as well.

ChrisBarker-NOAA · 2018-07-20T18:54:04Z

If string is now the "preferred" way to do it, it would be great to update as many examples as we can...

JimBiardCics · 2018-07-20T19:37:30Z

Perhaps alternate examples? Adding alternatives verbiage to each seems excessive.

Dave-Allured · 2018-07-20T20:29:48Z

I agree with the new wording in PR #138. This handles the concept of variable length strings stored within fixed dimension character arrays. Thanks, Jim.

DocOtak · 2018-07-20T20:34:00Z

I agree with @ChrisBarker-NOAA in that I feel updating existing examples is unnecessary unless string will become the preferred data type, examples are examples, not definitions, and you should always try to find the actual text "rules".

Since there are now two ways of representing equivalent data values, maybe just provide a single example of how to represent the same text variable data as both string and character arrays. Other examples are trying to demonstrate different things.

JimBiardCics · 2018-07-20T20:35:22Z

Replaced my first pull request by a more complete one, and added a pull request for the Conformance document.
pull request #140
pull request cf-convention/Conformance#7

JimBiardCics · 2018-07-20T20:36:52Z

Trac is up again, so I made a Trac ticket as a placeholder.

https://cf-trac.llnl.gov/trac/ticket/175#ticket

JonathanGregory · 2018-07-23T12:40:53Z

Dear Jim

Many thanks for doing a thorough job in finding where changes are needed.

In history.adoc, I think this issue number (139) should be mentioned - I believe we've agreed that this is needed to make the changes traceable, as for trac tickets. Since the issue mentions the pull request number, we don't need to put that in the history as well.
In ch07.adoc, in the paragraph where you've made a small change, I notice there are some backticks ` which shouldn't be there. They are markup for verbatim text, as in Trac. Maybe this syntax is wrong for asciidoc? It's nothing to do with your change, but it could be fixed as well.
In ch02.adoc, it could be helpful to include a CDL snippet showing a string array represented in the two different ways. You have changed all the examples to strings. That implies that you're thinking we should recommend the string type - yes? If so, we should say so. That also have to appears as a recommendation in the conformance document, and warnings would be produced for all use of char arrays in this way. Or are they equally acceptable? If so, we should say that.

Best wishes

Jonathan

JonathanGregory · 2018-07-23T12:41:59Z

Sorry, I didn't mean to close it!

By the way, I found it really convenient to look at your changes with the rich-diff comparison that I learned about at the meeting.

Thanks.

JimBiardCics · 2018-07-23T12:50:38Z

@JonathanGregory I have added the issue number to the history and attempted to fixe the backticks. I had changed half the examples, overall, to string from char. I can add an example if you think we still need one.

hrajagers · 2018-07-23T13:44:32Z

Should in ch02.adoc something be stated about the fact that string data type is only available when using netcdf-4? Or can we state that netcdf-4 is now the de facto standard convention by now?

Note: Recent experiments with the string data type learned that some tools, like MATLAB, still don't support that data type via the NetCDF4 interface. The work around is to reopen the file using the HDF5 interface.

With the introduction of the string data type, we move away from the 1-byte characters. So, this opens up a discussion on encodings. The _Encoding attribute has been suggested in the past. We may want to postpone the _Enconding attribute to another issue, but should we something state about the default encoding already?

JonathanGregory · 2018-07-23T14:02:54Z

@JimBiardCics - Thanks for the changes. Yes, I think an illustration of an equivalent string array and char array, with nothing else, would be useful in ch02. You could also comment that the examples in the remainder of the document are some of one and some of the other. Jonathan

JimBiardCics · 2018-07-23T14:27:03Z

@JonathanGregory - How's that?

JonathanGregory · 2018-07-23T14:28:59Z

Lovely. Many thanks. Jonathan

JimBiardCics · 2018-07-23T15:00:38Z

@hrajagers - I've added a sentence addressing the netCDF-4 issue for string type. I'm going to submit another change after this one to address attributes of type string, and that one will get into ASCII vs Unicode encoding possibilities.

ChrisBarker-NOAA · 2018-07-23T20:50:39Z

@hrajagers brings up encoding -- always a challenge!

Any reason we can't say that ALL "things" of the string type are utf-8? period. end of story.

I'd also love to say that all CHAR data is ASCII (or maybe latin-1) -- if you need unicode, use a string.

The odds are very good that if you are dealing with software that can only handle char, it isn't going to handle Unicode well anyway.

Reasoning:

For "over the wire" encoding, utf-8 encoding is the best choice, and has become a defacto standard - (and an actual standard for e.g. JSON). And lots of people think "utf-8" == "Unicode" -- they are wrong, but if we always use utf-8, then people and tools that handle unicode properly will work well, and tools and people that don't will still mostly work.

See: http://utf8everywhere.org/ for a strong opinion. Personally, I think they are wrong about "in memory", but their arguments do apply to "on disk" or "over the wire" -- essentially any interchange situation.

As for ASCII for CHAR -- the char type (at least in arrays) has to be fixed length -- utf-8 is not a fixed-length encoding -- that is, 10 "characters" may require 10 or more bytes to store. And if a string is truncated naively, it could result in an invalid string.

Since netcdf provides a variable length string that's the obvious way to deal with Unicode.

And UTF-8 is the obvious way to encode Unicode.

JimBiardCics · 2018-07-23T21:00:45Z

To discuss of ASCII / UTF-8, please go to issue #141. I'm hoping to address those questions there.

ChrisBarker-NOAA · 2018-07-23T22:09:35Z

I was about to post there, but that seems to be about attributes in particular, which is only a subset of the issue.

but we're cross-referenced now :-)

JimBiardCics · 2018-07-24T13:03:26Z

@ChrisBarker-NOAA It seems to me that the encoding question more directly comes to a head with attributes, but the two are definitely intertwined. I didn't want to do all of it in one giant change, so that's where I decided to make the break.

JimBiardCics · 2018-10-26T13:03:50Z

So, getting back to this issue, I like @ChrisBarker-NOAA's idea of stating that string variables (and attributes too, but I'll talk about that in #141) should be treated as utf-8 and that char variables are to be restricted to ASCII (or perhaps latin-1). If there is demand, we could open a new ticket to add an 'encoding' attribute for other cases, but I feel like we could leave that sleeping dog lie for now.

Thoughts?

ChrisBarker-NOAA · 2018-10-26T16:36:36Z

+1 (obviously).

And I agree -- the "encoding" attribute could be added later if there is real demand, though I'd much rather not allow arbitrary encoding.

We may want to have an encoding attribute that says "utf-8", so that it will be clear to folks reading the file, but that would be the only valid value for CF.

ChrisBarker-NOAA · 2018-10-26T16:52:42Z

One more note: from #141, someone wrote:

"In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. "

So that's it -- "strings will be UTF-8" we're done with that part :-).

Dealing with unicode with bare CHAR arrays is a bad idea -- sure you can do it, as a CHAR array can hold anything -- but we don't recommend using char arrays to hold, e.g. floating point data with an attribute saying that it's an IEEE 754 32 bit float -- we could do that, but it would be a really bad idea.

The one use case that makes at least a little bit of sense is to use CHAR arrays to pass encoded data around, if and only if the library doing the passing around is not expected to interpret the data as text in any way. This is why *nix systems have been able to get away with poorly specified encoding of filenames for so long. But do CF-data handling libraries ever do that?

In short, does anyone need Unicode that can't use a string type?

If modern Fortran netcdf libs can't handle strings, I can't imagine they do the right thing with Unicode anyway.

ghost · 2018-10-26T21:43:17Z

One more note: from #141, someone wrote:

"In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. "

So that's it -- "strings will be UTF-8" we're done with that part :-).

That would be me and I got called out on this so let me set the record straight. The HDF5 library does not verify or enforce the encoding of bytes stored in HDF5 strings. The HDF5 string datatype has a tag for declaring the encoding and currently supports two values, for ASCII and UTF-8. The application is responsible for setting the correct tag value.

ChrisBarker-NOAA · 2018-10-26T21:58:46Z

OK, so maybe HDF-% doesn't require UTF-* only, but it also doesn't support anything else -- at least for now.

Is seem is would be a really bad idea to do something unsupported for CF.

marqh · 2019-07-18T15:41:53Z

Hello @JimBiardCics CFers

I'm happy to offer to moderate this discussion, if that is helpful.
I have 'assigned' myself to the ticket, in preparation for a positive response to this offer.

@JimBiardCics please may I ask that you update the description of the pull request to use the text
fixes next to the issue number?
(https://help.github.com/en/articles/closing-issues-using-keywords)
this formally links the issue and PR and eases management of the pair

thank you
mark

marqh · 2019-07-18T15:44:30Z

Please may I ask contributors to this ticket to consider whether there are any in-principal objections, concerns, or key factors to consider to this change, that are still outstanding.
If so, please may you re-raise the topic as a reply to this comment?

If there are targeted comments on the text of the proposed change, please target them at #140, as a review of the textual change.

thank you
marqh

JonathanGregory · 2019-07-18T16:28:45Z

I support this change almost as it stands, and am grateful to Jim for preparing it.

I have one textual concern. In ch06, in the "Labels" section, "not" has been deleted in the sentence, "[a scalar coordinate variable] has the same information content and can be used in the same contexts as a string-valued auxiliary coordinate variable of a size one dimension which has not been added to the data variable. This is a convenience feature." The deletion of "not" shows that it wasn't clear what the sentence is trying to say. Rather than trying to explain it here, I suggest ending the sentence at "size one dimension", followed by the next sentence, "This is a convenience feature." It's explained fully in Section 5.7 on scalar coordinate variables, so there's no need for a half-hearted explanation here.

Cheers, Jonathan

JimBiardCics · 2019-07-18T17:44:25Z

@marqh I added a fixes #139 line to the description. Does it look right?

JimBiardCics · 2019-07-18T17:49:00Z

@JonathanGregory I think that's a great idea. I'll take care of it.

martinjuckes · 2019-07-22T11:52:32Z

Hi @JimBiardCics , this looks good. I'm not sure I've understood the pull-request workflow here, but it looks as though there have been 9 updates recorded in history.adoc since your branch was forked. Does this mean some re-basing is needed?

JimBiardCics · 2019-07-22T13:39:52Z

@martinjuckes We'll do a merge. If there are colliding changes, we'll sort them out.

marqh · 2019-08-13T15:45:59Z

thank you to all contributors for the input. There have now been 26 days with no objections.

I think this change proposal is suitable for acceptance.

I suggest that @JimBiardCics and I ensure that the PR #140 is finalised and suitable for merge, then we will confirm that there is no material change from the current commit: 8c7d66d

and implement the adoption of this change into the master version of the conventions

DocOtak · 2019-08-13T16:05:35Z

While not appropriate for this PR/Issue since it is the topic of strings. Perhaps let's also address the current support netcdf4 has for actual unsigned int types in another issue?

JimBiardCics · 2019-08-14T13:14:54Z

@DocOtak It sounds like a great candidate for another issue/PR!

JimBiardCics mentioned this issue Jul 20, 2018

Add support for variables of type string #140

Merged

JimBiardCics mentioned this issue Jul 20, 2018

Add support for variables of type string cf-convention/Conformance#7

Open

JonathanGregory closed this as completed Jul 23, 2018

JonathanGregory reopened this Jul 23, 2018

JimBiardCics mentioned this issue Jul 24, 2018

Add support for attributes of type string #141

Open

davidhassell added the enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format label Oct 29, 2018

martinjuckes mentioned this issue Jul 17, 2019

Updating definition of coordinate variable to account for NUG changes #174

Closed

marqh self-assigned this Jul 18, 2019

dblodgett-usgs added this to the v1.8 milestone Sep 10, 2019

dblodgett-usgs closed this as completed in #140 Sep 10, 2019

davidhassell mentioned this issue Jan 28, 2020

Implement string data-types for CF-1.8 NCAS-CMS/cfdm#12

Closed

RosalynHatcher mentioned this issue Feb 26, 2020

Implement string data-types for CF-1.8 cedadev/cf-checker#79

Open

RosalynHatcher mentioned this issue Jul 24, 2020

Add support for variables of type string to conformance doc #291

Merged

4 tasks

Add support for variables of type string #139

Add support for variables of type string #139

Comments

JimBiardCics commented Jul 19, 2018 • edited Loading

Dave-Allured commented Jul 19, 2018

Dave-Allured commented Jul 19, 2018

ChrisBarker-NOAA commented Jul 19, 2018

Dave-Allured commented Jul 19, 2018

JonathanGregory commented Jul 20, 2018

JimBiardCics commented Jul 20, 2018

JonathanGregory commented Jul 20, 2018

JimBiardCics commented Jul 20, 2018

DocOtak commented Jul 20, 2018 • edited Loading

JimBiardCics commented Jul 20, 2018 • edited Loading

JimBiardCics commented Jul 20, 2018

ChrisBarker-NOAA commented Jul 20, 2018

JimBiardCics commented Jul 20, 2018

Dave-Allured commented Jul 20, 2018

DocOtak commented Jul 20, 2018

JimBiardCics commented Jul 20, 2018

JimBiardCics commented Jul 20, 2018

JonathanGregory commented Jul 23, 2018

JonathanGregory commented Jul 23, 2018

JimBiardCics commented Jul 23, 2018 • edited Loading

hrajagers commented Jul 23, 2018

JonathanGregory commented Jul 23, 2018

JimBiardCics commented Jul 23, 2018

JonathanGregory commented Jul 23, 2018

JimBiardCics commented Jul 23, 2018

ChrisBarker-NOAA commented Jul 23, 2018 • edited Loading

JimBiardCics commented Jul 23, 2018

ChrisBarker-NOAA commented Jul 23, 2018

JimBiardCics commented Jul 24, 2018

JimBiardCics commented Oct 26, 2018

ChrisBarker-NOAA commented Oct 26, 2018

ChrisBarker-NOAA commented Oct 26, 2018

ghost commented Oct 26, 2018

ChrisBarker-NOAA commented Oct 26, 2018

marqh commented Jul 18, 2019

marqh commented Jul 18, 2019

JonathanGregory commented Jul 18, 2019

JimBiardCics commented Jul 18, 2019

JimBiardCics commented Jul 18, 2019

martinjuckes commented Jul 22, 2019

JimBiardCics commented Jul 22, 2019

marqh commented Aug 13, 2019

DocOtak commented Aug 13, 2019

JimBiardCics commented Aug 14, 2019

JimBiardCics commented Jul 19, 2018 •

edited

Loading

DocOtak commented Jul 20, 2018 •

edited

Loading

JimBiardCics commented Jul 20, 2018 •

edited

Loading

JimBiardCics commented Jul 23, 2018 •

edited

Loading

ChrisBarker-NOAA commented Jul 23, 2018 •

edited

Loading