Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Localized metadata in NetCDF files #244

Open
turnbullerin opened this issue Jun 1, 2023 · 69 comments
Open

Localized metadata in NetCDF files #244

turnbullerin opened this issue Jun 1, 2023 · 69 comments
Labels
question Further information is requested or discussion invited

Comments

@turnbullerin
Copy link

turnbullerin commented Jun 1, 2023

Hi Everyone!

So I work for the Government of Canada and I am working on defining the required metadata fields for us to publish data in NetCDF format. We'll be moving a lot of data into this format, so we are trying to make sure we get the format right the first time. The CF conventions are our starting point for metadata attributes.

As the data will be officially published by the Government of Canada eventually, we will have to make sure the metadata is available in both English and French. If the data contains English or French text (not from a controlled list), it needs to be translated too. I haven't found any efforts towards creating a convention for bilingual (or multilingual) metadata and data in NetCDF formats, so I wanted to reach out here to see if anyone has been working on this so we could collaborate on it.

My initial thought is that the metadata should be included in such a way as to make it easy to programmatically extract each language separately. This would allow applications that use NetCDF files (or tools that draw on the CF conventions like ERDDAP) to display the available language options and let the user select which one they would like to see without additional clutter. It should also be included in a way that does not impact existing applications to ensure compatibility.

Of note though is that some data comes from controlled lists where the values have meaning beyond the English meaning. This data probably shouldn't be translated as it would lose its meaning. For many controlled lists, applications can use their own lookup tables to translate the display if they want, and bigger vocabulary lists (like GCMD keywords) can have translations available on the web.

ISO-19115 handles this by defining "locales" (a mix of a mandatory ISO 639 language code, optional ISO 3166 country code, and optional IANA character set) and using PT_FreeText to define one value per locale for different text fields. I like this approach and I think it can translate fairly cleanly to NetCDF attributes. To align with ISO-19115, I would propose two global attributes, one called locale_default and one called locale_others (I kept the word 'locale' in front instead of at the end like in ISO-19115 since this groups similar attributes and I see this is what CF has usually done). The locale_others could use a prefix system (like what keywords_vocabulary uses) to separate different values. I would propose using the typical standards used in the HTTP protocol for separating the language, country, and encoding, e.g. language-COUNTRY;encoding. Maybe encoding and country are not necessary, I'm not sure, I just know ISO included them.

I would then propose using the prefixes from locale_others as suffixes on existing attribute names to represent the value of that attribute in another locale.

For example, this would give us the following global attributes if we wanted to include English (Canada), French (Canada), and Spanish (Mexico) in our locales and translate the title:

  :locale_default = 'en-CA;utf-8';
  :locale_others = 'fra:fr-CA;utf-8 esp:es-MX;utf-8';
  :title = 'English Title';
  :title_fra = 'Titre française';
  :title_esp = 'Título en español';

I was torn if the default locale should define a prefix too, if it did, it would let one use the non-suffixed attribute name for a combination of languages as the default (for applications that don't support localization); for example:

  :locale_default = 'eng:en-CA;utf-8';
  :locale_others = 'fra:fr-CA;utf-8 esp:es-MX;utf-8';
  :title = 'English Title | Titre française';
  :title_eng = 'English Title'
  :title_fra = 'Titre française';
  :title_esp = 'Título en español';

But then this seems like an inaccurate use of locale_default since the default is actually a combo. Maybe English should be added to locale_others in this case and locale_default changed to something like und;utf-8 or even just use the delimiter like [eng] | [fra] to show the format.

I haven't run into a data variable that needs translating yet, but if so, my thought was to define an attribute on the data variable that would allow an application to identify all the related localized variables (i.e. same data, different locale) and which variable goes with which locale. Something like

  var_name_en:locale = ':var_name';      # locale identified in locale_default
  var_name_fr:locale = 'fra:var_name';   # locale identified in locale_others

Thoughts, feedback, any other suggestions are very welcome!

@turnbullerin turnbullerin added the question Further information is requested or discussion invited label Jun 1, 2023
@czender
Copy link

czender commented Jun 1, 2023

Interesting idea. If you'd like more input/discussion, this could form the basis for a breakout at the upcoming 2023 CF Workshop. ..

@turnbullerin
Copy link
Author

oh thats cool - I can't find any info on that yet, I guess more info will be coming later?

@ethanrd
Copy link
Member

ethanrd commented Jun 2, 2023

Hi Erin - The dates for the 2023 CF Workshop (virtual) were just announced (issue #243). There has also been a call for breakout session proposals (issue #233). Further information will be broadcast here as well so everyone watching this repo will get the updates. A web page for the workshop will be added to the CF meetings page in the next month or two.

@Zeitsperre
Copy link

Zeitsperre commented Aug 22, 2023

Hi @turnbullerin and others,

I wanted to echo my interest in seeing a metadata translation convention come about from the CF Conventions. My team and I have been developing some implementations of metadata translations to better support our French-speaking users, as well as open the possibility of supporting other language translations for climate metadata.

One of our major open source projects for calculating climate indicators (xclim) has an internationalization module built into it for conditionally providing translated fields based on the ISO 639 Language code found within the running environment's locale or set explicitly. For more information, here is some documentation that better describes our approach:

We would love to take part in this discussion if there happens to be a session in October.

Best,

@Dave-Allured
Copy link

Erin, I like the general direction of your localization proposal. I would like to suggest a simplified strategy. I do not see a need for those global attributes or the level of indirection represented in them. In short, I suggest simply add ISO-19115 suffixes to standard CF attribute names, as needed. Here are a few more details.

  • The general form would be attribute_name.lang-country.

  • lang-country is the two-part locale, exactly as prescribed by ISO-19115.

  • country is optional, just like you described.

  • Never use the third element, IANA charset. The entire netCDF name space is fixed on UTF-8, so charset here is unnecessary.

  • The entire ISO suffix is optional. With that, you are left with just three possible basic forms, and no other complications. The ISO suffixes can now be easily be machine-recognized by aware software:

    • attribute_name
    • attribute_name.lang
    • attribute_name.lang-country

More details:

  • Define English as the core language of CF controlled vocabulary.
  • Define English as the universal default locale. This is basically saying, keep the status quo as the default.
  • When localization is desired, add the lang suffix.
  • Avoid country except when the data publisher deems it necessary.
  • attribute_name and attribute_name.en are fully redundant and should not both be applied together.
  • This localization strategy is intended equally for both CF-controlled and user-defined attribute names.
  • The primary application is to provide bilingual support. For example, under this scheme you might have long_name.en and long_name.fr on the same data variable.

The choice of the primary delimiter will be controversial. I like period "." for visual flow and general precedent in language design. Some will hold out for underscore as the CF precedent. I think underscore is overused in CF. In particular, the ISO suffix deserves some kind of special character to stand out as a modifier.

The general use of special characters such as "." and "-" is part of proposal cf-convention/cf-conventions#237.

@turnbullerin
Copy link
Author

turnbullerin commented Oct 3, 2023

@Dave-Allured

Thanks for your feedback!

I think there is value in the two attributes.

Defining English (and which English, eng-US, eng-CA, eng-UK, etc.) as the universal default is very Anglo-centric. There is a clear use case for datasets produced in other countries to have a primary language that is not English, and documenting it is valuable to inform locale-aware applications processing CF-compliant files. Not everyone will want to provide an English version of every string. So having an attribute that defines the default locale of the text strings in the file is still useful I feel, but perhaps we could define the default if not present as "eng" (no country specified) so that it can be omitted in many cases.

For the other locales, I think it helps applications and humans reading the metadata to know what languages are in the file. If we did not list them, applications would need to be aware of all ISO-639 codes and check each attribute if it exists with any mix of country/language code suffix to build a list of all languages that exist in the metadata. Having a single attribute list them all has a lot of value in my opinion. In unilingual datasets, it can of course be omitted.

This also raises the question on if we should use ISO 639-1 or ISO 639-2/T or ISO 639-3. ISO 19115 allows users to specify the vocabulary that codes are taken from, but if we were to specify one I would recommend ISO 639-2/T for language and ISO 3166 alpha-3 for country (this aligns with the North American Profile of ISO-19115). Alternatively, we could just specify the delimiter and let people override the vocabulary for language and country codes in attributes if they want.

I am torn on the delimiter - I see the value in what you propose, but I would not want to delay this issue if #237 is not adopted quickly and I foresee some technical issues adopting it even if it is agreed to (for example, the Python NetCDF4 library supports attributes as Python variables on the dataset or variable objects [and thus are restricted to [A-Za-z0-9_]; allowing arbitrary names would require them to make a significant change before the standard could be adopted; see https://unidata.github.io/netcdf4-python/#attributes-in-a-netcdf-file).

I do like the idea of standardizing the suffixes though and if we can agree on a format, I support that wholeheartedly. I would propose _xxxYYY where xxx is the lower-case ISO 639-2/T code and YYY is the ISO 3166 alpha-3 country code. If #237 is adopted, .xxx-YYY is also a good solution I think. We could include both for compatibility with applications and libraries that won't support #237 right away if adopted.

Also, I fully agree on UTF-8. It supports all natural languages as far as I know, so there should be no issue with using it as the default encoding. However, I do note that the NetCDF standard allows for other character sets - I guess we are then just saying that all text data must be in UTF-8 (i.e. _Encoding="utf-8")?

In terms of display, I agree with you that locale-aware applications (given a country and language code they should display in) should use the attributes in the following order:

  1. attribute_langCOUNTRY
  2. attribute_lang
  3. attribute

@Dave-Allured
Copy link

Erin, thank you for your very thoughtful reply.

Anglo-centric: Yes I was thinking about that when I wrote down my initial thoughts, but I decided to test the waters. I am glad to have triggered that direct conversation. English is a dominant language in the science and business worlds. However, this CF enhancement is a great opportunity for constructs to level the playing field, within the technical context of file metadata.

I agree immediately to the value of a global attribute that sets the default language for the current data file, such that all string attributes with no suffix are interpreted in the specified language. I leave the name of such attribute up to you and others. Yes, keep the default as English if the global attribute is not included.

@larsbarring
Copy link

I think adding support for multiple languages to selected CF attribute values would be a great addition. As I have absolutely zero insight into the technical aspects please bear with me if I am asking a stupid question: If this functionality is implemented without an universal default language does it mean that all string valued attributes are expected to follow a specified locale? If so, how would CF attributes that only can take values from a controlled vocabulary be treated, e.g. units, standard_name, cell_methods, axis, calendar?

Thanks,
Lars

@Dave-Allured
Copy link

List of languages present: It really is no problem to scan a file's metadata, pull off all the language specifiers, and sort them into an organized inventory. This is the kind of thing that can be programmed once, added in to a convenience library, and then used by everybody. If you have a redundant inventory attribute, you immediately have issues with maintenance and mismatches. Such issues will persist forever.

@Dave-Allured
Copy link

ISO vocabulary: It would be really nice if CF could settle on single universal choices for thelang and country vocabs. I really like the cadence of [dot] [two] [dash] [three], and no extra steps for alternative vocabularies. Failing that, I would suggest defer to an ISO 19115 self-identifying scheme if there is such a thing. I suppose there could be a vocabulary identifier global attribute, but I would like to avoid that if possible.

@turnbullerin
Copy link
Author

@larsbarring I think we would apply this only to natural language attributes, not to those taking their values from a controlled vocabulary.

So title, summary, acknowledgement, etc. are translated; units, standard_name, cell_methods, etc. are not.

Perhaps some form of identification of those would be useful?

@turnbullerin
Copy link
Author

turnbullerin commented Oct 3, 2023

@Dave-Allured

I think identifying what is and is not a language specifier might be challenging. Assuming attribute_xxx[YYY] as an algorithm, I would write this:

  1. Look at every attribute name.
  2. If it has an underscore, take the text from the last underscore to the end of the string and continue. Otherwise next attribute name (not a locale).
  3. If it is not either 3 or 6 letters long and either all lower case (if 3) or first three lower case and last three upper case (if 6), then continue (not a locale)
  4. Check that the first three are a defined ISO 639-2/T code and the last three (if present) are a defined ISO 3166 alpha-3 code (requires a list of all valid codes that needs to be updated as ISO makes changes to those vocabularies). If not, continue (not a locale)
  5. Assemble and deduplicate the results

Versus, with an attribute, it is:

  1. Read the attribute and split it by spaces.

I think, while it can be done, having an attribute with all languages in the file greatly simplifies the code for understanding which languages are present (which is the point of some of the metadata, like we could calculate geospatial_max_lon and geospatial_min_lon but we have those for convenience). It also ensures attributes which happen to look like valid localized attributes are not actually treated as such.

@Dave-Allured
Copy link

Identifying: Yeah. ;-) Add this to my list of reasons for dot notation.

Suffixes ... We could include both ...

I see great value in settling on a single, optimal syntax up front, and not providing alternative syntaxes. I also value adopting an exact syntax from ISO 19115, rather than having a new CF creation. You already see my preference for dot and dash, and my reasons. I think it is worth holding out for the optimal syntax. I see a growing interest in character set expansion for CF.

The classic netCDF API's included special character handling from the moment of their creation. Python can adapt.

I like 2-letter ISO 639 language codes, but 3-letter will be okay too. Choose one. I defer to your greater expertise on the various ISO flavors. I am not well studied there.

@Dave-Allured
Copy link

Erin, take everything I said as mere suggestions. I do not want to bog you down with too much technical detail, right before the upcoming workshop. Good luck!

@turnbullerin
Copy link
Author

turnbullerin commented Oct 3, 2023

So, after today's workshop on this, here's a rough draft of what I think we should include for the moment. It is still open for discussion

ADDITION TO 2.5 (prior to 2.5.1 heading, after the existing text)

Files that wish to provide localized (i.e. multilingual) versions of variables shall reference section #TBD for details on
how to do so.

ADDITION TO 2.6 (prior to 2.6.1 heading, after the existing text following 2.6)

Files that wish to provide localized (i.e. multilingual) versions of attributes shall reference section #TBD for details on
how to do so.

NEW SECTION

TBD. Localization

Certain attributes and variables in NetCDF files contain natural language text. Natural language text is written for a specific locale: this defines the language (English), the country (Canada), the script (English alphabet), and other features. This section defines a standard pattern for localizing a file, which means to specify the default locale of a file and for providing alternative versions of such attributes or variables in alternative locales using a suffix. The use of localization is OPTIONAL. If localization information is not provided, applications SHOULD assume the locale of the file is en.

Localization of attributes and variables is limited to natural language string values that are not taken from a controlled vocabulary. See Appendix A for recommendations on localization of CF attributes. Non-CF text attributes that use a natural language may also be localized using these rules. Locales are defined by a "locale string" that follows the format specified in BCP 47.

Localized files MUST define an attribute locale_default containing a locale string. All natural language attributes and variables without a language suffix MUST be written in this language. The default language of a file should be the one with the most complete set of attributes and variables in that particular language and, ideally, the original language the attributes and variables were written in.

Localized files with more than one locale MUST define an attribute locale_others which is a blank separated list of locale strings. Natural language attributes and variables MAY then be localized by creating an attribute or variable with the same name but ending in [LOCALE], replacing LOCALE with the relevant locale string. Any natural language attribute or variable ending in [LOCALE] must be provided in the given locale.

Applications that support localized NetCDF files SHOULD apply BCP 47 in determining the appropriate content to show a user if the requested locale is not available. If one cannot be found, the default value to display MUST be the attribute without suffix if available. Supporting localization is OPTIONAL for applications.

The following is an example of a file with Canadian English (default), Canadian French and Mexican Spanish with the title and summary attribute translated but missing Spanish summary.

:locale_default = "en-CA";
:locale_others = "fr-CA es-MX";
:title = "English Title";
:title[fr-CA] = "Titre française";
:title[es-MX] = "Título en español";
:summary = "English Summary";
:summary[fr-CA] = "Sommaire française";

An application supporting localization would display the following:

Selected Language en-CA fr-CA es-MX jp
Title English Title Titre française Título en español English Title
Summary English Summary Sommaire française English Summary English Summary

ADDITION TO APPENDIX A

  • Add a column for "Locale-Aware" (Y or N) or maybe add a new data type of S for non-locale-aware string and S-L for locale-aware string?
  • Locale-aware string attributes:
    • comment
    • flag_meanings (? they have underscores but would be helpful ?)
    • history (? I feel like this will be complex since they are automatically updated but having a translated version of the history would be helpful ?)
    • institution
    • long_name
    • references
    • source
    • title

References
https://www.rfc-editor.org/info/bcp47
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

@DocOtak
Copy link
Member

DocOtak commented Oct 3, 2023

@turnbullerin I did more research after our post meeting discussion:

  • Here is the W3C Language Tags and Locale Identifiers for the World Wide Web which basically summarized BCP 47, somewhat interesting, it requires that you cite BCP 47 and not any of the underlying RFCs.
  • In the W3C document, the language subtags (e.g. "en", "fr") comes from a controlled list published by the IANA (the IP address authority). This list is extensive, I searched for a few of the languages on Wikipedia's List of endangered languages in Canada and every language I checked had an entry in the IANA vocabulary. I even checked some of the indigenous languages I've personally been exposed to (Kumeyaay, Hawaiin Pidgin, Louisiana Creole) all had entries. Even Star Trek's Klingon language has an entry. The W3C document specifically mentions this list to avoid ambiguity as to which ISO 639 to use (and I think remaining compatible with it)
  • Wikipedia has a nice summary of these IETF language tag and how the various RFCs and everything relates. Extension U might be of interest.

@DocOtak
Copy link
Member

DocOtak commented Oct 3, 2023

Here is a CDL strawman for what I was asking about regarding namespacing:

netcdf locale {

// global attributes:
		:locale_default = "en-CA" ; // how to interpret non namespace attrs
		:locale_others = "fr-CA, es-MX" ; // format that matches the Accept-Language priority list in HTTP
		:title = "English Title" ;
		string :fr-CA\:title = "Titre française" ; // the : is escaped by nc dump, I made this netCDF file with python
		string :es-MX\:title = "Título en español" ; // and the netCDF4 python library forces string type if non ASCII code points exist
}

I think @ethanrd said there was an attribute namespace discussion, my quick searching couldn't find it. I would suggest that : becomes a reserved character in CF for locale in attribute names.

Happy for more discussion on this at tomorrow's (or Thursday's session). I also have some code I'd like to share.

@turnbullerin
Copy link
Author

turnbullerin commented Oct 4, 2023

Will update to cite BCP 47 explicitly - I imagine that's so that if the underlying RFCs change, the reference doesn't have to change. I think the IANA list is fine (I imagine it's what the RFCs refer to) and we can include a link.

Rather than following the Accept-Language in HTTP, I think we should match the current standard for lists in CF (space-delimited, no commas).

Here's a (very old) discussion I found on namespacing: https://cfconventions.org/Data/Trac-tickets/27.html

Personally I find namespacing for languages confusing, namespacing is usually to group things of a common type rather than a more specific version of a thing. Instead of namespacing at the beginning, maybe instead we could reserve a trailing set of square brackets for containing a locale? Like title[fr-CA] (kinda looks like xpath then)? As long as the unicode issue is resolved and going in soon - if it's rejected, maybe we can just replace the hyphens with underscores (so title_fr_CA).

@DocOtak
Copy link
Member

DocOtak commented Oct 4, 2023

@turnbullerin Adding this here so it isn't lost in the zoom chat.

I coded up some examples using python and xarray (the ncdump CDL is at the bottom) https://github.com/DocOtak/2023_cf_workshop/blob/master/localization/localized_examples.ipynb

My takeaway from the unicode breakout was that the proposal will not be rejected, but details need to be worked out. So we can expect any of the options that use attribute names outside what is currently allowed to be OK in the future.

@turnbullerin
Copy link
Author

Thanks for the coding example!

I was looking into what ERDDAP supports and apparently it only supports NetCDF attributes that follow the pattern [A-Za-z_][A-Za-z0-9_]*. I will flag this to them to see if we can gain some traction for updating that fairly quickly as it will be a showstopper until it is for me personally. While maybe we should consider that there might be other libraries that will choke on a full Unicode attribute name, I'm not sure we should be making decisions solely based on what libraries have chosen to do (especially when it doesn't align with what NetCDF allows to start with).

datasets.xml error on line #184
While trying to load datasetID=cnodcPacMSC50test (after 1067 ms)
java.lang.RuntimeException: datasets.xml error on or before line #184: In the combined global attributes,    attributeName="publisher_name[en]" isn't variableNameSafe. It must start with iso8859Letter|_ and contain only iso8859Letter|_|0-9 .
 at gov.noaa.pfel.erddap.dataset.EDD.fromXml(EDD.java:486)
 at gov.noaa.pfel.erddap.LoadDatasets.run(LoadDatasets.java:364)
Caused by: java.lang.RuntimeException: In the combined global attributes, attributeName="publisher_name[en]" isn't variableNameSafe. It must start with iso8859Letter|_ and contain only iso8859Letter|_|0-9 .
 at com.cohort.array.Attributes.ensureNamesAreVariableNameSafe(Attributes.java:1090)
 at gov.noaa.pfel.erddap.dataset.EDD.ensureValid(EDD.java:829)
 at gov.noaa.pfel.erddap.dataset.EDDTable.ensureValid(EDDTable.java:677)
 at gov.noaa.pfel.erddap.dataset.EDDTableFromFiles.<init>(EDDTableFromFiles.java:1915)
 at gov.noaa.pfel.erddap.dataset.EDDTableFromNcFiles.<init>(EDDTableFromNcFiles.java:131)
 at gov.noaa.pfel.erddap.dataset.EDDTableFromFiles.fromXml(EDDTableFromFiles.java:501)
 at gov.noaa.pfel.erddap.dataset.EDD.fromXml(EDD.java:472)
 ... 1 more

That said, they have to consider other metadata formats as well, so there might be restrictions in those.

@MathewBiddle
Copy link

From the ERDDAP docs

destinationNames MUST start with a letter (A-Z, a-z) and MUST be followed by 0 or more characters (A-Z, a-z, 0-9, and _). ('-' was allowed before ERDDAP version 1.10.) This restriction allows data variable names to be the same in ERDDAP, in the response files, and in all the software where those files will be used, including programming languages (like Python, Matlab, and JavaScript) where there are similar restrictions on variable names.

@turnbullerin
Copy link
Author

@MathewBiddle yeah, that's going to be an issue - that said, cf-convention/cf-conventions#237 has identified several very good use cases where these restrictions are not reasonable for the description of scientific variables (notably some chemistry names that include apostrophes, dashes, and commas) so I don't think that is going to block this change.

@MathewBiddle
Copy link

I see you created an issue in the ERDDAP repo, so I'll comment over there on the specifics for ERDDAP.

I just need to say that this is a fantastic proposal and I'm glad to see such a robust conversation here.

@turnbullerin
Copy link
Author

After discussions with the ERDDAP people, I think a full Unicode implementation is going to take a long time and I suspect there are other applications out there who will also struggle to adapt to the new standard. There are a lot of special characters out there that have special meanings ([] is used as a hyperslab operator in DAP for example) and I'm concerned about interoperability if we do something that greatly changes how names usually work.

I would propose that we then stick to the current naming convention for attributes and variables in making a proposal for localization (possibly using the double underscore to make it clearly a separate thing) for now since it would maximize interoperability with other systems that use NetCDF files. We could keep the prefix system or we could just use the locale but replacing hyphens with underscores (so title_en_CA and title_fr_CA).

@Dave-Allured
Copy link

Dave-Allured commented Oct 9, 2023

Here are some further suggestions.

  • BCP 47 is an excellent choice for the referred standard. It was designed for data applications such as netCDF, among other things.

  • In the CF spec, the first mention should use a full reference such as "IETF BCP 47 language tags" as demonstrated in the Wikipedia title. Following references should be the shorthand BCP 47.

  • BCP 47 references other standards for elements such as language and region. Therefore, do not mention these other standards in the proposed CF text. Including would be distracting and might conflict with future evolution of BCP 47.

  • By all means, show a few examples.

  • BCP 47, RFC 5646, 4.1 Choice of Language Tag includes recommendations for minimizing tags. IMO this is important enough to be paraphrased in CF.

    "A subtag SHOULD only be used when it adds useful distinguishing
    information to the tag. Extraneous subtags interfere with the
    meaning, understanding, and processing of language tags."

  • Use the exact BCP 47 syntax in the content of thelocale attribute, and related attributes. These are string contents, not the attribute names themselves. In particular, keep using the ASCII hyphens, as prescribed. E.g. locale = "fr-CA".

  • Tags attached to attribute names continue to be a difficult topic. Let us consider that to be a side conversation.

@turnbullerin
Copy link
Author

turnbullerin commented Oct 11, 2023

@Dave-Allured excellent points. I will rewrite as suggested and will shift the text to its own repo here so we can do a pull request when we're done.

After thinking about this a lot, I think I'm seeing some good real use cases for why one might not want to follow a particular naming convention - in certain contexts, some characters might be more challenging to use and predicting them all is difficult (see my post on the Unicode thread for reserved characters in different contexts). Making what I think of as a fairly core feature of metadata (multilingualism) dependent on Unicode support or even broader US-ASCII support is maybe not the best choice. Downstream applications relying on NetCDF files might specify their own standard. That said, using an alternative naming structure like [en] or .en makes it fairly clear that it isn't part of the variable name and follows NetCDF core rules, so I do like it. I just am concerned about the interoperability.

My suggestion to resolve this would be to define the default behaviour suffixes like [en] or .en and allow users to alter the suffixes by providing a map instead of a list in locale_others. In support of that, I would ban colons and spaces in suffixes and locales (which I don't think BCP 47 allows for anyways) for clarity. So, for example (using the [en] pattern as the default without prejudice here), these three configurations would be valid:

# Example 1
:locale_others = "fr";
:title[fr] = "French Title";

# Example 2
:locale_others = ".fr: fr";
:title.fr = "French Title";

# Example 3
:locale_others = "_fr: fr";
:title_fr = "French Title";

The code for it, in Python, would be something like

def parse_locale_others(other_locales: str) -> dict[str, str]:
    locale_map = {}
    pieces = [x for x in global_attributes['locale_others'].split(' ') if x != '']
    i = 0
    while i < len(pieces):
        if pieces[i][-1] == ":":
            locale_map[pieces[i][:-1]] = pieces[i+1]
            i += 2
        else:
            locale_map[f"[{pieces[i]}]" = pieces[i]
            i += 1
    return locale_map
    
   
def localized_title(metadata: dict[str, typing.Any]) -> dict[str, typing.Optional[str]]:
    default_locale = metadata['locale_default'] if 'locale_default' in metadata else 'en'
    other_locales = parse_locale_others(metadata['locale_others']) if 'locale_others' in metadata else {} 
    titles = {
        default_locale: metadata['title'] if 'title' in metadata else None
    }
    for locale_suffix in other_locales:
        localized_title_key = f"title{locale_suffix}"
        titles[other_locales[locale_suffix]] = metadata[localized_title_key] if localized_title_key in metadata else None
    return titles
    
    ```

Edit: We can also add text strongly suggesting people use the default unless there is a good reason not to.

@turnbullerin
Copy link
Author

turnbullerin commented Oct 23, 2023

@larsbarring for clarity, in Option 1 the format of the suffix is entirely up to the originator of the file and is specified completely in locale_others. All of the following files would be valid ways of specifying the French title

:locale_others = "_fr: fr-CA";
:title_fr = "French Title";

OR

:locale_others = "_fr_CA: fr-CA";
:title_fr_CA = "French Title";

OR

:locale_others = ".fr-CA: fr-CA";
:title.fr-CA = "French Title";

OR 

:locale_others = "[fr-CA]: fr-CA";
:title[fr-CA] = "French Title";

OR

:locale_others = "--foobar: fr-CA";
:title--foobar = "French Title";

The list of valid suffixes can then be determined from the locale_others attribute and any attribute or variable ending in a valid suffix is then considered to be a localized version of the non-suffixed attribute or variable. The downside of this option is the potential for confusion and a more complex parsing algorithm but the upside is we allow data originators to define the scheme that best works for them and their use case and doesn't conflict with any other names in their file. So if following ERDDAP conventions is important, they can specify suffixes that meet ERDDAP conventions. If they use periods to mean something in their variable names, they can use the square bracket syntax instead. Or they can invent their own.

@turnbullerin
Copy link
Author

How about an approach that uses meta variables to contain localization information? This approach being inspired by how geometry_containers work in CF. I've coded up an example that took this to the extreme as I extended the idea all the way to localizing the data itself, I'll try to explain it here.

  • We reserve 2 or 3 new attribute names that apply to global (and potentially variable) attributes:

    • locale - a string containing a single BCP 47 locale identifier
    • localizations - a string containing a space separated list of variable containing localized attributes for this scope, global or variable.
    • (optionally) localized_data only on variables, indicates that the data itself should be localized.

For the attributes:

  • All CF attributes (and ACDD ones or whatever), continue to use the standardized English attribute names. The locale of the values of those names is contained in the new attribute locale which must contain a BCP 47 locale tag.
  • If other localizations are available, the attribute localizations must contain a space separated list of other variable names (like the coordinates attribute in data variables) in the dataset.
  • On a data variable, the special attribute localized_data may be present with some truthy value (I used 1) that indicates the localization providing meta variable also contains localized data that should replace the data.

Localization providing meta variable:

  • The actual variable name of the meta variables are not controlled, but must follow other naming restrictions already in CF or your environment (ERDDAP, matlab, etc..) so they may appear in that space separated list.
  • A variable referenced by the localizations attribute is a localization meta variable
  • This variable contains a locale attribute with a BCP 47 locale tag with the locale of the attribute values
  • All other attributes on this meta variable are localized versions of the attributes in the referencing scope (global or variable), e.g. title would still be title. Not all attributes of the referencing variable must also be present in on the meta variable, only the localized attributes. I.e. the meta variables attributes must be a strict subset of the referencing variable.
  • If the localized_data attribute of the referencing variable is set, then this meta variable must contain data with the same shape as the referencing variable.

Other notes:

  • I intentionally omitted the locale tags in the localizations attribute and opted for it to only contain variable names that themselves has a locale attribute.
  • I wouldn't specify a default locale if one is not defined in the dataset, but rather separate datasets into "aware" or "naive" indicated by the presence of the locale attribute in the globals or on a variable.

Advantages I found with this approach:

  • No need to come up with attribute name mangling conventions
  • No need to allow different characters than already allowed
  • The global/primary attributes remain uncluttered, only two/three additional attributes with well defined/controlled values
  • I was able to immediately extend this idea to the data itself without much work, even if we don't want to allow it now.
  • ERDDAP can ignore the localization providing variables since, IIRC, you need to configure ERDDAP with the specific variable you want to have (even if the underlying netCDF files has more)

Disadvantages:

  • Many new variables in the file, in my example with many languages, this looks cluttered. I would expect most real world usage to be English + one other locale
  • Probably other things I cannot think of due to being a pythonista

Edit: formatting of the Disadvantages

Just to make sure I understand, this is proposing basically one extra variable with no data per language that would have the global attributes set as variable attributes? And one extra variable per language per variable with both localized metadata (i.e. long_name) and, if applicable, the actual data localized? Then tracking all of that with attributes to connect the dots?

This feels inefficient to me but I'll let others weigh in as well :).

@larsbarring
Copy link

larsbarring commented Oct 23, 2023

@turnbullerin thanks for explaining how you envisage you option 1.

I will here continue my previous comment that I had to pause. As I wrote, I think that we should very careful in overloading the underscore with conceptually new roles. I interpret earlier comments from @turnbullerin, and @aulemahal (and possibly others) that this is seen as a necessity to meet restrictions from downstream systems and applications, rather than something desirable in its own right.

While I do think that interoperability is a key concept for CF (essentially that is why we have CF in the first place...). But there will always be software somewhere for which some new functionality or concept will not be possible at all to implement, or just not practical for some reason. Hence I think that we have to be concrete and specific when using concerns for interoperability as a argument.

In this issue ERDDAP has be used as a use-case of an important downstream application. Thank you @rmendels for your comment regarding ERDDAP, and for the link to the ERDDAP/erddap#114 issue !

When browsing through that issue I see that the conversation soon expanded to deal with the implications for ERDDAP if all (or at least a large set of) unicode characters were to be allowed in attribute names, and in variable names. In that respect it pretty much mirrors what is going on in cf-convention/cf-conventions#237. This was maybe where we were at in our conversation here a couple of weeks ago when "your issue" was initiated. Since then the conversation here has developed so that now only two or three additional characters are needed to implement localization. And these are hyphen -, as well as either period . or the two square brackets [ ], all from the good old ASCII character set. @rmendels do you think this in any way makes it more tractable for ERDDAP?

@DocOtak
Copy link
Member

DocOtak commented Oct 23, 2023

@turnbullerin Yes, basically one extra variable per language/locale, I didn't want to use the term "namespace" but this is a mechanism I saw in the netCDF-LD ODC propsal. My example is a little busy as I hadn't thought about the variable localization at all yet in these discussions. And when I realized I could localize data, it felt really powerful and I immediately tried it. The localization variables could be shared between data variables so perhaps not every data variable would need and independent localization variable (e.g. if it uses entirely controlled attributes).

The intent of my proposal was to avoid all the attribute name convention arguments. My proposal uses some pretty well established CF mechanisms to keep our proposal from conflicting with what might already be in the file. One of the other issues that the ERDDAP team raised was how to parse attribute names and how that might conflict in existing datasets.

Using a non standard BCP 47 locale tag (i.e. one that has had dashes replaced with the underscore) I think would be bad. Even though I disagree strongly with using netCDF variable and attribute names as program language symbols... I'm now more hesitant with introducing any sort of parsing grammar for the attribute names themselves given the concerns expressed by the ERDDAP team (@Dave-Allured ?). So my most recent proposal completely does away with needing to parse attribute names other than matching exactly in the same file.

I suspect that, given what I know about how ERDDAP is configured, the extra variable proposal would allow localizations to be added to an existing dataset on ERDDAP today and it would ignore the extra variables, unless reconfigured to be aware of them. The extra attributes in the data variables and global attributes would not have anything ERDDAP breaking in them. ERDDAP would continue to be unaware of localizations in the dataset until that functionality is added.

I would like to prepare a "real" data file with only French and English to see what it would actually look like. @turnbullerin do you have any thing that could be used for this example?

@larsbarring
Copy link

From the recent conversation over at ERDDAP/erddap#114 I think the position of the ERDDAP folks is clear. They are pretty dependent on a character set limited to [A-Za-z0-9_] for variables and attributes.

From my side I do not have much more to contribute regarding how to introduce localization into CF than what I stated before. If the top priority is to support existing software (irrespective of age and provenance) then using underscore to implement localization seems as the only option. The drawback is that this introduces the new role for the underscore to be a delimiter between attribute and locale. Moreover, and importantly, CF would then become even more locked down into the current character set restrictions, while the general netCDF community goes to the other extreme by allowing almost all of Unicode. An at the same time as there are (and will be ) more and more well motivated requests from various communities for relaxing the restrictions. But that is a conversation more suited in cf-convention/cf-conventions#237.

If the conclusion is that the CF community should go ahead with underscore to implement localisation i will not be the one that blocks.

@DocOtak
Copy link
Member

DocOtak commented Oct 26, 2023

@larsbarring I've attempted an option that eliminates the use of underscore or any attribute name parsing (only attribute values) in this comment. Please take a look. If my kitchen sink example is too busy or hard to understand, I could make a simpler one.


PS: @turnbullerin Don't be discouraged by this long process, actual changes to the conventions take time and everyone here is a volunteer

@larsbarring
Copy link

Hi @DocOtak, it took me a little while and some experimentation to get into what you suggest. To me it looks like a general and powerful approach, but also a bit awkward by requiring one variable per locale and per variable that have localized attributes (as @turnbullerin notes). This might be a possible solution. At the same time I was looking back with (at least somewhat) fresh eyes on other suggested solutions:

  • The solution I was pushing for (e.g. title.fr_CA) has the drawback that it reserves the period for a specific function. If the aim is to have the same rules for attribute names and variable names, which I am not sure we have to, this goes against what has been requested elsewhere. And this problem will remain the same irrespective of which character(s) we might select to delimit the locale. Because of that we might as well stay within the current character set, which means that I am [reluctantly!] accepting underscore as delimiter for this particular purpose.

  • Looking at @turnbullerin's second alternative ("suffix starting with an underscore and replacing hyphens with underscores (i.e. title_fr_CA)": I think this alternative is complicated by the fact that the attribute may have any number of underscores and the locale identifier, if at all present, might have zero or one underscore. This is a complication when decoding at which underscore to make the break between attribute and its locale.

  • Turning to the first alternative "if you had locale_others = "_fr: fr-CA"; your French title would be in title_fr", I think the flexibility is more of a complication that being helpful.

Based on these comments, here is another simplified alternative inspired by Erin's first and second alternative. Only the one global attribute locales is needed:

  1. If it is not present there is no information as to the language used in the relevant attributes, and there are no localized attributes. This is the present situation.
  2. If present it will contain a space separated list of <key>:<tag> pairs.
  3. The <key> is either the [reserved] word default, or a string beginning with an underscore, and no underscores elsewhere.
  4. The <tag> is a known IETF BCP 47 language tag.
  5. If the key is default then the <tag> is supposed to inform, without any guarantee, about the language used in the relevant attributes that do not have a have a language tag as suffix. If this key is not present there is no information as to the language used in these attributes.
  6. Localized attributes are identified by the suffix formed by attaching a key from the list at the end of the attribute name.
  7. The localization will be applied to all attributes (throughout the file) that has a suffix that is among the tags in the locales
  8. If a tag is not used as suffix in any attribute name then nothing happens. If the suffix of an attribute is not a key listed in locales then the attribute (incl. the suffix) is basically not a CF attribute.

An example:

// global attributes:
		:locales = "default:en_US _sv:sv _esmx:es-MX" ;
		:title = "English Title" ;
		:title_sv = "Svensk titel" ;
		:title_esmx = "Título en español" ;

In this example I have used "mangled" language tags as keys, but this is not requires (but good practice?). This has the advantage of easy reading for humans, and still simple decoding for software. If one wants to restrict the freedom in choosing keys an alternative is to allow only "loc1", "loc2", "loc3", .... but I do not think this is necessary.

This suggestion has the following advantages: it requires only one global attribute, the keys (suffixes) only have an underscore as first character, the tags follow the established format, it is"lightweight".

It seems so simple that I wonder if I have overlooked something?

@turnbullerin
Copy link
Author

turnbullerin commented Oct 27, 2023

@larsbarring Your concept seems very similar to what I was proposing and I think it is great. I think there's some value in separating out the "default" tag but I'm not married to the idea of it being in a separate attribute; having it with a "magic" suffix seems fine too. I'm just not a fan of "magic number" type things and adding an extra attribute made more sense to me. Alternative to a magic suffix, we could say that locales must follow the format DEFAULT_TAG SUFFIX1:TAG1 SUFFIX2:TAG2 ... and just omit a suffix entirely for it?

Is there value in restricting the character set like this though? From what I was reading, CF doesn't tend to make things mandatory without a good cause. I appreciate the "underscore = space" argument but I think that's actually a good reason to not make it REQUIRED so others can make their own decisions on how to mangle and what characters to use or omit.

Instead, I would suggest we RECOMMEND starting suffixes with an underscore followed by ASCII letters (A-Z and a-z) for maximum compatibility. To avoid parsing complications I would suggest though that we say that colons MUST NOT be part of the suffix (and they won't be part of the language tag by BCP 47) which makes it very easy to parse and to identify the default tag if we use my suggest above (it is the one without a colon after splitting on spaces)

@turnbullerin
Copy link
Author

turnbullerin commented Oct 27, 2023

For consistency, we could also have it have a "blank" suffix which I like better than a keyword like "default" (so it would be locales = ":en-US _sv:sv _esmx:es-MX"; in your example)

@turnbullerin
Copy link
Author

PS: @turnbullerin Don't be discouraged by this long process, actual changes to the conventions take time and everyone here is a volunteer

Thanks for the pick me up :) I'm not too discouraged, I work for the Government lol. Change takes time and even if I'm more usually of the approach of "well try something and take good notes, then do it better next time", I recognize a major feature like this to a significant and widely used standard will be both contentious and lengthy to agree on. But it's so worth it :). Plus I get paid to have these discussions at work which is nice.

@turnbullerin
Copy link
Author

If the top priority is to support existing software (irrespective of age and provenance) then using underscore to implement localization seems as the only option. The drawback is that this introduces the new role for the underscore to be a delimiter between attribute and locale. Moreover, and importantly, CF would then become even more locked down into the current character set restrictions, while the general netCDF community goes to the other extreme by allowing almost all of Unicode.

I think this is good cause to RECOMMEND but not REQUIRE the A-Za-z0-9_ limitation for localization. It lets groups move forward with a more modern version of the attributes where their technology supports it, but gives them the information they need to understand the impact. It also lets them pick a delimiter that isn't misunderstood by whatever other packages they're using if they don't like what we decide as a recommendation/default.

@larsbarring
Copy link

@turnbullerin a couple of comments and questions

@larsbarring Your concept seems very similar to what I was proposing and I think it is great.

Yes, it is your idea, no doubt, I was just making some minor adjustments here and there: credit where credit's due.

Regarding which of the following is best I am not sure:
locales = "default:en-US ....";
locales = ":en-US ....";
locales = "en-US ....";
I guess the upper one is easier for humans and the middle is more consistent with the fact that for the non-localized attribute there is simply nothing. The lower one I feel is less attractive, but I can live with either. Anyway, do you suggest that the default is mandatory or optional?

I am not sure that I follow when you write:

... suggest we RECOMMEND starting suffixes with an underscore followed by ASCII letters (A-Z and a-z) for maximum compatibility.

Given the current CF limitation to [A-za-z0-9_] for variable and attribute names, which I think might take some time to change, should I understand that you suggest that any of the other characters is acceptable (although not RECOMMENDED), e.g. titleZfrca or title9en as a localized title? (I don't think ... :-) Anyway, I take your point regarding the general direction.

@turnbullerin
Copy link
Author

turnbullerin commented Oct 30, 2023

Given the current CF limitation to [A-za-z0-9_] for variable and attribute names, which I think might take some time to change, should I understand that you suggest that any of the other characters is acceptable (although not RECOMMENDED), e.g. titleZfrca or title9en as a localized title? (I don't think ... :-) Anyway, I take your point regarding the general direction.

@larsbarring other than the ":" character, I think it would be acceptable but not recommended practice. It doesn't affect a programmatic interpretation of the attributes, it's just more confusing to human readers. I think people would avoid that anyways. But it would allow things like title__en or title_en_ca to be used. And, if CF ever changes its limitation from A-Za-z0-9_, we won't have to rewrite this paragraph to let people use things like .en-CA as a suffix (but also we aren't dependent on changing that limitation).

Maybe it would be better to say:

we RECOMMEND starting suffixes with an underscore followed by ASCII letters (A-Z and a-z) for maximum compatibility, but suffixes MUST consist of characters allowed for CF attribute and variable names and MUST NOT contain a colon.

Though the last is redundant and perhaps confusing as long as CF doesn't allow colons in attribute/variable names anyways.

As an analogy for why I feel this way, I would note CF doesn't restrict people from doing confusing things in other areas - for example, I can name my variables var1, var2, var3, var4, etc. and this is perfectly CF legal. I don't even need to follow the "spaces are underscores" convention, I can name my variable RelativeHumidity or relativehumidity or relhumid or whatever I feel like (as long as my standard_name is right). I wouldn't say it's a recommended best practice but it doesn't make a file non-compliant.

@turnbullerin
Copy link
Author

turnbullerin commented Oct 30, 2023

I'd also add quickly that a REQUIRED format of _[A-Za-z]* still leaves them with lots of room to do silly things - so we're still relying on common sense for human readability. Like the following would still be CF compliant:

locales = "_fr:es";
title_fr = "Spanish Title";

OR

locales = "_suffixOne:fr";
title_suffixOne = "French Title";

We are relying on people to choose suffixes that clearly represent the locale with any system where we let them define a suffix, so my thought is to leave it as open as possible and trust them do something sensible for human readability (as long as we can parse it).

@Dave-Allured
Copy link

I have started CF #477 to enable period (.) and hyphen (-) in attribute names only. This is in support of my recommended strategy, attribute.lang-country where lang-country is any BCP 47 language tag. This is proposal 3 in Erin's summary above.

#477 is intended to remove one roadblock to adopting proposal 3, or similar strategies that need either the period or hyphen characters. #477 is not intended to express preference or foreclose on any other localization strategies. If you agree with adding these two characters for attribute names only, please post a supporting comment on #477.

@larsbarring
Copy link

Hi Erin @turnbullerin,
Now when proposal cf-conventions/#477 is accepted, would you be willing, perhaps together with @Dave-Allured, to prepare an enhancement issue and pull request in the cf-conventions repo based on your good start and the comments in this thread?

I think this would be a very useful extension of the CF Conventions.

Many thanks,
Lars

@DocOtak
Copy link
Member

DocOtak commented Mar 4, 2024

Hi All, Just getting back into all the CF things after my long expedition (and Ocean Science meeting).

In my opinion, CF should strongly resit adding something to the standard that requires any programatic parsing and interpretation of the attribute keys themselves. Complexities of parsing attributes aside, I'm also concerned about "breaking" ERDDAP. At the Ocean Sciences meeting, all the talks/town halls I went to about the technical implementation of the goals of the UN Ocean Decade had ERDDAP featured somewhat heavily (if any data system was mentioned at all) and I think it is set for becoming the recommended way of serving data in national systems.

@rmendels
Copy link

rmendels commented Mar 4, 2024

@DocOtak I didn't know we had become so popular!!!! :-)

More seriously if I remember the lengthy discussion related to this (on a different list) for which Bob Simons knows a lot more about this than I, part of the discussion had to do with problems in ERDDAP code and part had to do with breaking clients (mostly where traversing some structure) as well as reading CDL files, I believe there were a few more examples.

@DocOtak
Copy link
Member

DocOtak commented Mar 4, 2024

@rmendels Kevin O'Brien is quite the advocate.

I didn't really want to say "look at my proposal again" since I'm not too attached to it, but my feeling is that this discussion got stuck on what the best way to mangle attributes is and not the possibility of alternatives.

Would folks (@turnbullerin @larsbarring @Dave-Allured others?) be willing to find time for a call to discuss/make progress?

@turnbullerin
Copy link
Author

@DocOtak I am happy to make time for a call!

@larsbarring I'm also happy to work on the enhancement and pull request.

My thoughts haven't changed too much, but I agree with a number of key points made, which I'll outline below as a starting point:

  1. We generally agree the goal of the proposal is worthy: a mechanism for internationalizing attribute values at least is of value (data values seem to have fewer good use cases but may also be necessary)
  2. ERDDAP is growing in usage and popularity, and from discussions with them, making major changes in their supported character set seems challenging at best. From this, I infer that using a mechanism that can be supported by ERDDAP as it is today would be beneficial (enabling downstream work on ERDDAP to focus on localization rather than on expanding the character set).
  3. Localized attributes can fairly easily be extracted for programmatic use to localize a display of the underlying dataset

The discussion is now focused on the technical issues of how to implement this.

With this in mind, mangling the names in any way that requires expanding the character set from CF 1.10 is probably a no-go as it goes against 2 - ERDDAP won't be able to easily support these without significant issues.

This leaves us with two options for implementation for attributes:

A. Using a suffix or other alteration of the attribute name to identify them using existing character sets.
B. Another approach, such as the one proposed by @DocOtak

Personally, using variables to group together locale-related global attributes seems to be counter-intuitive for me - structurally they're in the incorrect place and for someone not familiar with CF's use of them, it could be confusing. I wonder if a reasonable alternative would be to store triples that map an attribute name to a new attribute name in a given locale, e.g. as follows:

    :locale = "en";
    :localizations = "title fr title_fr;summary fr summary_fr;long_name fr long_name_fr";
    :title = "Title";
    :title_fr = "Titre";
    :summary = "Summary";
    :summary_fr = "Sommaire";

Maybe it would make the localizations attribute too long though? I don't know if there's a maximum length - could also compress it a bit by saying "locale1 en_attr1 fr_attr1 en_attr2 fr_attr2 ; locale2..."?

I'm open to other ideas too! Maybe we can brainstorm other solutions, but I'm still leaning towards a clean and backwards-compatible mangling approach as the easiest to manage.

@larsbarring
Copy link

Yes, I am happy to particiapte in a call. /Lars

@turnbullerin
Copy link
Author

turnbullerin commented Jul 5, 2024

I am way behind making the call sorry :). Life of a national manager.

I looked a bit more at the parsing and processing side of things though, and I am more strongly leaning towards the suffix-based approach but with user-defined suffixes - I think mandating .en-CA as the format for suffixes is what is most likely to break ERDDAP and other platforms, and I think there's little sense in us mandating how names are mangled (since that is what we are caught up in). There were no objections to a suffix-based approach in the ERDDAP thread in terms of complexity (in fact, Bob more strongly favored title_fr_CA as the format). ERDDAP isn't really set up to handle meta variables from what I can see of its source code and I think it would be far more difficult to implement on their end - ERDDAP basically only supports attributes directly on the global dataset or a variable being output to the user and a variable without data isn't well supported in the XML configuration options.

I think by specifying the allowable suffixes and meanings in an attribute itself, we aren't then interpreting the attribute names themselves, merely the presence or absence of specific names (e.g. title_fr_CA only has meaning if _fr_CA is in the locales attribute as a valid suffix, thus we are parsing the attribute content for that suffix and its meaning). I would agree that specific mangling patterns that have to be untangled by the application processing the file without a supporting attribute is too prone to confusion and difficulties with parsing (e.g. title_fr_CA with no locales attribute to explain what _fr_CA means).

So, given the challenges I foresee ERDDAP having with meta variables, I would propose we move forward with an update based on suffixes. I'll prepare some sample text.

@turnbullerin
Copy link
Author

turnbullerin commented Jul 5, 2024

ADDITION TO 2.5 (prior to 2.5.1 heading, after the existing text)

Files that wish to provide localized (i.e. multilingual) versions of the content of variables shall reference section #TBD for details on how to do so.

ADDITION TO 2.6 (prior to 2.6.1 heading, after the existing text following 2.6)

Files that wish to provide localized (i.e. multilingual) versions of the content of attributes shall reference section #TBD for details on how to do so.

NEW SECTION

TBD. Localization

Certain attributes and variables in NetCDF files contain natural language text. Natural language text is written for a specific locale: this defines the language (e.g. English), the country (e.g Canada), the script (e.g English alphabet), and/or other features of the natural language. Locales are defined by a "language tag" that follows the format specified in BCP 47 //link to BCP47 here//, such as en-CA for Canadian English in the default script. This section defines the standard pattern for localizing the contents of a NetCDF file.

Localization of attributes and variables is limited to natural language string values that are not taken from a controlled vocabulary. See Appendix A for recommendations on localization of CF attributes. Non-CF text attributes that use a natural language may also be localized using these rules. To localize an attribute or variable, an alternative version of it is supplied using a suffix for its name that is associated with the language tag.

TBD.1 Localized Files

A "localized file" is one that provides the global attribute localizations. If present, the attribute must contain a space-delimited list words in the format suffix: language_tag. For example, the string default: en _fr: fr-CA _es: es-MX specifies that the default locale of the file is en, that the suffix _fr indicates content in the fr-CA locale, and that the suffix _es indicates content in the es-MX locale. Suffixes may be any text string allowed in an attribute or variable name, but it is strongly recommended that they be chosen for clarity by making them clearly associated with the locale they represent.

The default locale should be chosen to represent the most complete set of attributes and variables; if only some of the natual language text attributes have localized versions, then the more complete language should be chosen as the default. Where there are two or more complete sets, the predominant language that the content was originally written in should be chosen.

An attribute or a variable in a localized file must not have a name ending with a locale suffix unless it is used to indicate the locale as per this section.

Applications that process NetCDF files are encouraged to apply BCP 47 in determining which content to show a user when localized content is available. When content is not available in a suitable locale for the user, the default locale should be used.

TBD.2 Localized Attributes

Localized attributes are created by appending a locale suffix to the usual attribute name. For example:


variables:
    :localizations = "default: en-CA _fr: fr-CA _es: es-MX";
    :title = "English Title";        // English title 
    :title_fr = "Titre française";   // French title
    :title_es = "Título en español"; // Spanish title
    :summary = "English Summary";
    :summary_fr = "Sommaire française";
    // omitted Spanish summary means English will be used instead

    double salinity(i);
    salinity:long_name = "Salinity";
    salinity:long_name_fr = "Salinité";
    salinity:long_name_es = "Salinidad";

TBD.3 Localized Variables

Localized variables are created by appending a locale suffix to the variable name; note that this is only necessary where the data stored in the variable itself is localized and does not come from a controlled vocabulary. Natural language attributes for a localized variable should be provided in the locale of that variable. Localized versions of a variable must be of the same data type and dimensions and must contain the same number of elements appearing in the same order (i.e. weather_obs[0] is the English text and weather_obs_fr[0] is the French text of the same value).


variables:
    :localizations = "default: en-CA _fr: fr-CA _es: es-MX";

    char weather_obs(i);
    weather_obs:long_name = "Weather Conditions";

    char weather_obs_fr(i);
    weather_obs_fr:long_name = "Observations Météorologiques";
    
    char weather_obs_es(i);
    weather_obs_es:long_name = "Observaciones Meteorológicas"

data:
    weather_obs = "sunny", "rainy", ...;
    weather_obs_fr = "ensoleillé", "pluvieux", ...;
    weather_obs_es = "soleado", "lluvioso", ...;

ADDITION TO APPENDIX A

  • Add a column for "Locale-Aware" (Y or N) or maybe add a new data type of S for non-locale-aware string and S-L for locale-aware string?
  • Locale-aware string attributes:
    • comment
    • flag_meanings (? they have underscores but would be helpful ?)
    • history (? I feel like this will be complex since they are automatically updated but having a translated version of the history would be helpful ?)
    • institution
    • long_name
    • references
    • source
    • title

References
https://www.rfc-editor.org/info/bcp47
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

EDIT NOTES:

  1. I added a note that variables must be the same type, dimensions, and size as each other
  2. I noted the format of cell_methods and updated mine to match

@turnbullerin
Copy link
Author

I'd like to especially draw people's attention to the change in Appendix A I put above as there are some open questions there still that have not been answered.

@JonathanGregory
Copy link
Contributor

JonathanGregory commented Jul 8, 2024

Dear Erin @turnbullerin

Thanks for your proposal. Although this issue started as a discussion, you're now making a definite proposal to change the convention. Therefore I think it would be appropriate if you began a new issue with this in the conventions repo.

Best wishes

Jonathan

@turnbullerin
Copy link
Author

Will do!

@JonathanGregory
Copy link
Contributor

Thanks, @turnbullerin. All interested in Erin's proposal, please comment on #528, and thanks for the discussion up to now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested or discussion invited
Projects
None yet
Development

No branches or pull requests