Tweak language content to address remaining checklist issues #136

fsteeg · 2023-09-12T14:22:07Z

First move language-related subsections into common section: 0da547f
Then the actual language content tweaks to address remaining checklist issues (in particular on valid / well-formed BCP 47 tags and IANA registry): d2a49fd

For details see #125 (where each item links to the relevant part of the W3C i18n docs).

In particular on valid / well-formed BCP 47 tags and IANA registry

awagner-mainz · 2023-09-14T09:53:51Z

I am sorry for being late to comment here. I am a bit worried that a use case of mine that has been possible to solve in past versions may end up no longer being so:

I have used a reconciliation service for a multilingual SKOS vocabulary to not only get the Identifier for a concept, but also the preferredLabel in a language that I would specify, independently from the language of the data I had. This made it possible to normalize data where one field has been supplied in different languages. Say, I have a field "subject matter" and it contains both German and Danish values for the same concept. I would ask the service to return reconciliation results in English. The service happily found the relevant entries among the labels in different languages it has for its concepts, and it returned the concept identifier and the preferredLabel in english, if available (and an empty label field if there was no english label). Besides the identifiers themselves, I could thus fill a new column in OpenRefine with the english preferred label for all the rows.

If I want to reproduce this now, I will obviously set the Accept-Language header to "en", but what and where should I specify as the text processing language?
In fact I want the query to compare the query term against text fields in more or less all the languages that the authority database has. I guess I have to set a text-processing language somewhere, be it only to avoid any one default language eventually defined by the authority data publisher.

Can I specify "*" as the text processing language along with my query term?
Would it make sense (and be legitimate) for the spec to declare that the default text processing language (if none is specified) for processing queries is "all the languages"?

Is it just me or is the W3C i18n Best Practices geared very much towards data publication rather than querying?

wetneb · 2023-09-14T10:17:50Z

How about not specifying any text processing language?

The Accept-Language: en header does not imply that the values you are supplying to the service are in the same language, I think.

awagner-mainz · 2023-09-14T10:34:47Z

The Accept-Language: en header does not imply that the values you are supplying to the service are in the same language, I think.

But I want all results to supply the english label. What labels will not setting the Accept-Language produce?

How about not specifying any text processing language?

The spec currently says: "If no explicit text-processing language is given, the metadata language (the language of the intended audience) provided first (see service definition) is considered the default text-processing language." If I did provide the "en" metadata language tag (see above), then that would make the reconciliation service consider only the english labels for matching, no?

fsteeg · 2023-09-14T12:05:56Z

I am a bit worried that a use case of mine that has been possible to solve in past versions may end up no longer being so.

Anything that worked before should still be possible, since none of the language-related changes are mandatory. These are all SHOULD or MAY. Maybe we need to be clearer about that in the spec?

If I did provide the "en" metadata language tag (see above), then that would make the reconciliation service consider only the english labels for matching, no?

No, it only means that the service should assume that the language of the intended audience is English (metadata language) and that the provided labels are in English (default text-processing language, if none is set). What the service does with that information, or if it needs it at all, is up to the service.

wetneb · 2023-09-14T12:14:58Z

No, it only means that the service should assume that the language of the intended audience is English (metadata language) and that the provided labels are in English (default text-processing language, if none is set)

Maybe it makes sense to remove this last assumption, no?
In the context of OpenRefine, I would expect that we set the Accept-Language header to the language used by the user for the interface (or any other language specified specifically for that service, if we have the UI for that), but that does not mean that the data they are working on is in that language. So I'd find it good that services do not assume that this header is a sensible text-processing language.

fsteeg · 2023-09-14T12:36:03Z

Maybe it makes sense to remove this last assumption, no?

The reason for that was basically this requirement from the checklist (#125):

If there is only one language declaration for a resource, and it has more than one language tag as a value, it must be possible to identify the default text-processing language for the resource. -- #lang_mixing

So that's for the case of more than one language in the header, but it also addresses this:

The specification should indicate how to define the default text-processing language for the resource as a whole. -- #lang_whole_res

The latter could be solved with a lang attribute in the manifest (like dir in #137), but from the former it seems like we would still have to address the case where we only have a header with more than one language.

I think we can just consider that default text-processing language as a hint. The service can use that information to process the passed values, but if it does not actually need info on the text-processing language, it can simply ignore the fact that there is a default text-processing language.

awagner-mainz · 2023-09-14T16:51:35Z

I am very sorry for not showing up in today's meeting. I was mistakenly under the impression that we would be meeting tomorrow. I apologize!

Having just seen the minutes of today's meeting I think leaving out the intended audience language being interpreted as the default text processing language will definitely ease my worries. Everything else, like pros and cons of presuming a text processing language, how to reflect the intended audience language in the query results or example scenarions and best practices, is maybe better discussed on a wiki page or something like that.

But as we are already discussing this: Would it make sense to also reconsider the sentence "The lang value MUST be a single well-formed [BCP 47] language tag." in the beginning of section 8.3? Why should a query not indicate that it intends the query term to be processed in two languages? Again, this is much more about query terms than other fields. (And I acknowledge that I can either send two queries for the same term with two different languages, or not specify a text processing language in the request at all, thereby (hopefully, depending on the service) falling back to "all the languages".) Sorry if this is beating a dead horse.

fsteeg · 2023-09-15T11:08:16Z

Why should a query not indicate that it intends the query term to be processed in two languages?

I think the main misunderstanding here is that the text-processing language is not an instruction of any kind, to tell the service how or what to process, but an information about the language that a specific string is in. A service can always decide not to care about the language of a given string, and e.g. search for matches in all languages etc. To quote from the W3C docs:

So we are, by necessity, talking about associating a single language with the text, or some range of text, within the resource. Whereas the intended audience can be speakers of more than one language, a specific range of text can only be in one language at a time. -- W3C: Types of language declaration

We should probably make clear in the spec what the text-processing language actually is. Assigning myself ~~and switching this to a draft PR~~ (seems no longer possible, probably since it's been reviewed) for that, and for the removal of the default statement (plus an alternative for setting the text-processing language globally).

thadguidry · 2023-09-15T12:09:44Z

"Text-processing" sounds so ambiguous. We should maybe say "String language" or "human language of the string represented". I'd ideally and more formally (since we're describing an API spec) would rather call it "byte string" since for example UTF-8 can take up anywhere between 1-4 bytes.

Don't use the first language of the intended audience; see discussion at: #136 (comment)

Based on our discussion in the September meeting: https://etherpad.lobid.org/p/reconciliation-september-2023

Based on the W3C docs: https://www.w3.org/TR/international-specs/ & https://www.w3.org/International/questions/qa-text-processing-vs-metadata

fsteeg · 2023-12-12T12:28:44Z

Addressed the remaining issues here:

Make default text direction an explicit setting 70eb13f
Clarify requirements for language headers d2ca1e6
Define the two types of language declaration b400ac9

"Text-processing" sounds so ambiguous. We should maybe say "String language" or "human language of the string represented".

I think we should stick with the terminology from the W3C best practice docs. I hope with the added definitions the ambiguity is gone and people reading the spec will understand what that is about.

fsteeg added 2 commits September 12, 2023 14:39

Move language-related subsections into common i18n section (#125)

0da547f

Tweak language content to address remaining checklist issues (#125)

d2a49fd

In particular on valid / well-formed BCP 47 tags and IANA registry

fsteeg requested a review from wetneb September 12, 2023 14:22

wetneb approved these changes Sep 12, 2023

View reviewed changes

fsteeg self-assigned this Sep 15, 2023

fsteeg added 3 commits December 12, 2023 11:55

Make default text direction an explicit setting in the manifest (#136)

70eb13f

Don't use the first language of the intended audience; see discussion at: #136 (comment)

Clarify requirements for language headers (#136)

d2ca1e6

Based on our discussion in the September meeting: https://etherpad.lobid.org/p/reconciliation-september-2023

Add section to define the two types of language declaration (#136)

b400ac9

Based on the W3C docs: https://www.w3.org/TR/international-specs/ & https://www.w3.org/International/questions/qa-text-processing-vs-metadata

wetneb approved these changes Dec 14, 2023

View reviewed changes

Merge remote-tracking branch 'origin/master' into 125-language

bf8446d

fsteeg merged commit 6cd7ce2 into master Dec 18, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak language content to address remaining checklist issues #136

Tweak language content to address remaining checklist issues #136

fsteeg commented Sep 12, 2023

awagner-mainz commented Sep 14, 2023

wetneb commented Sep 14, 2023

awagner-mainz commented Sep 14, 2023 •

edited

Loading

fsteeg commented Sep 14, 2023

wetneb commented Sep 14, 2023

fsteeg commented Sep 14, 2023

awagner-mainz commented Sep 14, 2023

fsteeg commented Sep 15, 2023 •

edited

Loading

thadguidry commented Sep 15, 2023

fsteeg commented Dec 12, 2023

Tweak language content to address remaining checklist issues #136

Tweak language content to address remaining checklist issues #136

Conversation

fsteeg commented Sep 12, 2023

awagner-mainz commented Sep 14, 2023

wetneb commented Sep 14, 2023

awagner-mainz commented Sep 14, 2023 • edited Loading

fsteeg commented Sep 14, 2023

wetneb commented Sep 14, 2023

fsteeg commented Sep 14, 2023

awagner-mainz commented Sep 14, 2023

fsteeg commented Sep 15, 2023 • edited Loading

thadguidry commented Sep 15, 2023

fsteeg commented Dec 12, 2023

awagner-mainz commented Sep 14, 2023 •

edited

Loading

fsteeg commented Sep 15, 2023 •

edited

Loading