Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for variables of type string #139

Closed
JimBiardCics opened this issue Jul 19, 2018 · 44 comments · Fixed by #140
Closed

Add support for variables of type string #139

JimBiardCics opened this issue Jul 19, 2018 · 44 comments · Fixed by #140
Assignees
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Milestone

Comments

@JimBiardCics
Copy link
Contributor

JimBiardCics commented Jul 19, 2018

This change is the outcome of discussions at the 2018 netCDF/CF Workshop meeting in Reading, UK. The proposal is to change the language of Chapter 2 to include the string type as a valid variable type, and to modify the description of variables containing text to include variables with a type of string.

The changes can be seen in pull requests #140 and cf-convention/Conformance#7
Placeholder Trac ticket at https://cf-trac.llnl.gov/trac/ticket/175#ticket

@Dave-Allured
Copy link
Contributor

It is probably better to debate changes in GitHub issues rather than in pull requests. Therefore I am re-posting my comment about character arrays here (next comment), and removing it from the pull request #138.

Also I was once told that there is a preference for discussing something in issues first, then create the pull request later. This issue was started in the reverse order, thus I was a bit confused. I am a relative newbie to GitHub, so please educate me if I am off base about this. Just trying to do The Right Thing.

@Dave-Allured
Copy link
Contributor

Variable length strings can be stored in character arrays by the use of either null padding or blank padding. The storage is fixed length, but the purpose is variable length. The following sentence in the original pull request conflicts with this usage:

"If the character array option is chosen, the strings in a given array are defined to be equal in length."

This sentence is not essential to the description of strings within character type. It is an unneeded constraint. This sentence is also not necessary for the introduction of string data type. Therefore I request this sentence simply be removed. Thanks.

@ChrisBarker-NOAA
Copy link
Contributor

If it is discussing a suggested localized suggested change to text -- a PR is the perfect place for it.

Which this is, yes?

Is this a contentious issue? or just trying to come up with the best wording?

-CHB

@Dave-Allured
Copy link
Contributor

I will bear that in mind for future issues. On this one, there is substance in my proposed wording. Let's just keep it here and see if there is any objection to my view of variable length string storage.

@JonathanGregory
Copy link
Contributor

Dear all

I support this change being made, but I think it may require changes in more than one place in the conventions document. It needs a general statement somewhere early that an array of character data with a trailing dimension of maximum string length is equivalent to an array of strings without that dimension. Anywhere in the text which mentions the use of a character array to contain a string should be modified, and also any examples which show this could mention it as an alternative. Following the discussion at the meeting, I think this applies only to character variables, not character attributes. Is that correct? Possibly changes are needed in the conformance document too - I haven't looked.

Cheers

Jonathan

@JimBiardCics
Copy link
Contributor Author

I was hoping that stating both ways of forming string variables were valid and effectively interchangeable would be sufficient. I'm not sure where string variables show up in the rest of the document. Any assistance with tracking those places down would be appreciated.

I'm working on another change to address string attributes. The issue there is slightly different.

@JonathanGregory
Copy link
Contributor

Dear Jim

I think that would make the document somewhat less useful, especially for anyone who looks up something in the middle of it with a presumption that strings should be allowed, and doesn't find them shown in text or examples. Most users of the document don't sit down and read it from start to finish. Moreover I feel that we must check carefully that this won't cause problems - I don't think it will, but it is better to be safe. In fact the work involved in making this change is mostly that of reading the documents to check what might be affected, I suspect.

Best wishes

Jonathan

@JimBiardCics
Copy link
Contributor Author

@Dave-Allured, I've modified the wording. See what you think.

@DocOtak
Copy link
Member

DocOtak commented Jul 20, 2018

In regards to the other sections concern raised by @JonathanGregory, the chapter 6 section 1 on "Labels" has some language talking about character arrays. This was the only other part of the document I could find which talks explicitly about strings in variable data.

@JimBiardCics
Copy link
Contributor Author

JimBiardCics commented Jul 20, 2018

Appendix H contains quite a few examples using string variables of type char, and chapters 6 and 7 each have one. What, if anything, do people think should be done with those?

@JimBiardCics
Copy link
Contributor Author

There are small changes in chapters 5 and 7 as well.

@ChrisBarker-NOAA
Copy link
Contributor

If string is now the "preferred" way to do it, it would be great to update as many examples as we can...

@JimBiardCics
Copy link
Contributor Author

Perhaps alternate examples? Adding alternatives verbiage to each seems excessive.

@Dave-Allured
Copy link
Contributor

I agree with the new wording in PR #138. This handles the concept of variable length strings stored within fixed dimension character arrays. Thanks, Jim.

@DocOtak
Copy link
Member

DocOtak commented Jul 20, 2018

I agree with @ChrisBarker-NOAA in that I feel updating existing examples is unnecessary unless string will become the preferred data type, examples are examples, not definitions, and you should always try to find the actual text "rules".

Since there are now two ways of representing equivalent data values, maybe just provide a single example of how to represent the same text variable data as both string and character arrays. Other examples are trying to demonstrate different things.

@JimBiardCics
Copy link
Contributor Author

Replaced my first pull request by a more complete one, and added a pull request for the Conformance document.
pull request #140
pull request cf-convention/Conformance#7

@JimBiardCics
Copy link
Contributor Author

Trac is up again, so I made a Trac ticket as a placeholder.

https://cf-trac.llnl.gov/trac/ticket/175#ticket

@JonathanGregory
Copy link
Contributor

Dear Jim

Many thanks for doing a thorough job in finding where changes are needed.

  • In history.adoc, I think this issue number (139) should be mentioned - I believe we've agreed that this is needed to make the changes traceable, as for trac tickets. Since the issue mentions the pull request number, we don't need to put that in the history as well.

  • In ch07.adoc, in the paragraph where you've made a small change, I notice there are some backticks ` which shouldn't be there. They are markup for verbatim text, as in Trac. Maybe this syntax is wrong for asciidoc? It's nothing to do with your change, but it could be fixed as well.

  • In ch02.adoc, it could be helpful to include a CDL snippet showing a string array represented in the two different ways. You have changed all the examples to strings. That implies that you're thinking we should recommend the string type - yes? If so, we should say so. That also have to appears as a recommendation in the conformance document, and warnings would be produced for all use of char arrays in this way. Or are they equally acceptable? If so, we should say that.

Best wishes

Jonathan

@JonathanGregory
Copy link
Contributor

Sorry, I didn't mean to close it!

By the way, I found it really convenient to look at your changes with the rich-diff comparison that I learned about at the meeting.

Thanks.

@JimBiardCics
Copy link
Contributor Author

JimBiardCics commented Jul 23, 2018

@JonathanGregory I have added the issue number to the history and attempted to fixe the backticks. I had changed half the examples, overall, to string from char. I can add an example if you think we still need one.

@hrajagers
Copy link

  1. Should in ch02.adoc something be stated about the fact that string data type is only available when using netcdf-4? Or can we state that netcdf-4 is now the de facto standard convention by now?

Note: Recent experiments with the string data type learned that some tools, like MATLAB, still don't support that data type via the NetCDF4 interface. The work around is to reopen the file using the HDF5 interface.

  1. With the introduction of the string data type, we move away from the 1-byte characters. So, this opens up a discussion on encodings. The _Encoding attribute has been suggested in the past. We may want to postpone the _Enconding attribute to another issue, but should we something state about the default encoding already?

@JonathanGregory
Copy link
Contributor

@JimBiardCics - Thanks for the changes. Yes, I think an illustration of an equivalent string array and char array, with nothing else, would be useful in ch02. You could also comment that the examples in the remainder of the document are some of one and some of the other. Jonathan

@JimBiardCics
Copy link
Contributor Author

@JonathanGregory - How's that?

@JonathanGregory
Copy link
Contributor

Lovely. Many thanks. Jonathan

@JimBiardCics
Copy link
Contributor Author

@hrajagers - I've added a sentence addressing the netCDF-4 issue for string type. I'm going to submit another change after this one to address attributes of type string, and that one will get into ASCII vs Unicode encoding possibilities.

@ChrisBarker-NOAA
Copy link
Contributor

ChrisBarker-NOAA commented Jul 23, 2018

@hrajagers brings up encoding -- always a challenge!

Any reason we can't say that ALL "things" of the string type are utf-8? period. end of story.

I'd also love to say that all CHAR data is ASCII (or maybe latin-1) -- if you need unicode, use a string.

The odds are very good that if you are dealing with software that can only handle char, it isn't going to handle Unicode well anyway.

Reasoning:

For "over the wire" encoding, utf-8 encoding is the best choice, and has become a defacto standard - (and an actual standard for e.g. JSON). And lots of people think "utf-8" == "Unicode" -- they are wrong, but if we always use utf-8, then people and tools that handle unicode properly will work well, and tools and people that don't will still mostly work.

See: http://utf8everywhere.org/ for a strong opinion. Personally, I think they are wrong about "in memory", but their arguments do apply to "on disk" or "over the wire" -- essentially any interchange situation.

As for ASCII for CHAR -- the char type (at least in arrays) has to be fixed length -- utf-8 is not a fixed-length encoding -- that is, 10 "characters" may require 10 or more bytes to store. And if a string is truncated naively, it could result in an invalid string.

Since netcdf provides a variable length string that's the obvious way to deal with Unicode.

And UTF-8 is the obvious way to encode Unicode.

@JimBiardCics
Copy link
Contributor Author

To discuss of ASCII / UTF-8, please go to issue #141. I'm hoping to address those questions there.

@ChrisBarker-NOAA
Copy link
Contributor

I was about to post there, but that seems to be about attributes in particular, which is only a subset of the issue.

but we're cross-referenced now :-)

@JimBiardCics
Copy link
Contributor Author

@ChrisBarker-NOAA It seems to me that the encoding question more directly comes to a head with attributes, but the two are definitely intertwined. I didn't want to do all of it in one giant change, so that's where I decided to make the break.

@JimBiardCics
Copy link
Contributor Author

So, getting back to this issue, I like @ChrisBarker-NOAA's idea of stating that string variables (and attributes too, but I'll talk about that in #141) should be treated as utf-8 and that char variables are to be restricted to ASCII (or perhaps latin-1). If there is demand, we could open a new ticket to add an 'encoding' attribute for other cases, but I feel like we could leave that sleeping dog lie for now.

Thoughts?

@ChrisBarker-NOAA
Copy link
Contributor

+1 (obviously).

And I agree -- the "encoding" attribute could be added later if there is real demand, though I'd much rather not allow arbitrary encoding.

We may want to have an encoding attribute that says "utf-8", so that it will be clear to folks reading the file, but that would be the only valid value for CF.

@ChrisBarker-NOAA
Copy link
Contributor

One more note: from #141, someone wrote:

"In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. "

So that's it -- "strings will be UTF-8" we're done with that part :-).

Dealing with unicode with bare CHAR arrays is a bad idea -- sure you can do it, as a CHAR array can hold anything -- but we don't recommend using char arrays to hold, e.g. floating point data with an attribute saying that it's an IEEE 754 32 bit float -- we could do that, but it would be a really bad idea.

The one use case that makes at least a little bit of sense is to use CHAR arrays to pass encoded data around, if and only if the library doing the passing around is not expected to interpret the data as text in any way. This is why *nix systems have been able to get away with poorly specified encoding of filenames for so long. But do CF-data handling libraries ever do that?

In short, does anyone need Unicode that can't use a string type?

If modern Fortran netcdf libs can't handle strings, I can't imagine they do the right thing with Unicode anyway.

@ghost
Copy link

ghost commented Oct 26, 2018

One more note: from #141, someone wrote:

"In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. "

So that's it -- "strings will be UTF-8" we're done with that part :-).

That would be me and I got called out on this so let me set the record straight. The HDF5 library does not verify or enforce the encoding of bytes stored in HDF5 strings. The HDF5 string datatype has a tag for declaring the encoding and currently supports two values, for ASCII and UTF-8. The application is responsible for setting the correct tag value.

@ChrisBarker-NOAA
Copy link
Contributor

OK, so maybe HDF-% doesn't require UTF-* only, but it also doesn't support anything else -- at least for now.

Is seem is would be a really bad idea to do something unsupported for CF.

@davidhassell davidhassell added the enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format label Oct 29, 2018
@marqh marqh self-assigned this Jul 18, 2019
@marqh
Copy link
Member

marqh commented Jul 18, 2019

Hello @JimBiardCics CFers

I'm happy to offer to moderate this discussion, if that is helpful.
I have 'assigned' myself to the ticket, in preparation for a positive response to this offer.

@JimBiardCics please may I ask that you update the description of the pull request to use the text
fixes next to the issue number?
(https://help.github.com/en/articles/closing-issues-using-keywords)
this formally links the issue and PR and eases management of the pair

thank you
mark

@marqh
Copy link
Member

marqh commented Jul 18, 2019

Please may I ask contributors to this ticket to consider whether there are any in-principal objections, concerns, or key factors to consider to this change, that are still outstanding.
If so, please may you re-raise the topic as a reply to this comment?

If there are targeted comments on the text of the proposed change, please target them at #140, as a review of the textual change.

thank you
marqh

@JonathanGregory
Copy link
Contributor

I support this change almost as it stands, and am grateful to Jim for preparing it.

I have one textual concern. In ch06, in the "Labels" section, "not" has been deleted in the sentence, "[a scalar coordinate variable] has the same information content and can be used in the same contexts as a string-valued auxiliary coordinate variable of a size one dimension which has not been added to the data variable. This is a convenience feature." The deletion of "not" shows that it wasn't clear what the sentence is trying to say. Rather than trying to explain it here, I suggest ending the sentence at "size one dimension", followed by the next sentence, "This is a convenience feature." It's explained fully in Section 5.7 on scalar coordinate variables, so there's no need for a half-hearted explanation here.

Cheers, Jonathan

@JimBiardCics
Copy link
Contributor Author

@marqh I added a fixes #139 line to the description. Does it look right?

@JimBiardCics
Copy link
Contributor Author

@JonathanGregory I think that's a great idea. I'll take care of it.

@martinjuckes
Copy link
Contributor

Hi @JimBiardCics , this looks good. I'm not sure I've understood the pull-request workflow here, but it looks as though there have been 9 updates recorded in history.adoc since your branch was forked. Does this mean some re-basing is needed?

@JimBiardCics
Copy link
Contributor Author

@martinjuckes We'll do a merge. If there are colliding changes, we'll sort them out.

@marqh
Copy link
Member

marqh commented Aug 13, 2019

thank you to all contributors for the input. There have now been 26 days with no objections.

I think this change proposal is suitable for acceptance.

I suggest that @JimBiardCics and I ensure that the PR #140 is finalised and suitable for merge, then we will confirm that there is no material change from the current commit: 8c7d66d

and implement the adoption of this change into the master version of the conventions

@DocOtak
Copy link
Member

DocOtak commented Aug 13, 2019

While not appropriate for this PR/Issue since it is the topic of strings. Perhaps let's also address the current support netcdf4 has for actual unsigned int types in another issue?

@JimBiardCics
Copy link
Contributor Author

@DocOtak It sounds like a great candidate for another issue/PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants