-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak language content to address remaining checklist issues #136
Conversation
In particular on valid / well-formed BCP 47 tags and IANA registry
I am sorry for being late to comment here. I am a bit worried that a use case of mine that has been possible to solve in past versions may end up no longer being so: I have used a reconciliation service for a multilingual SKOS vocabulary to not only get the Identifier for a concept, but also the preferredLabel in a language that I would specify, independently from the language of the data I had. This made it possible to normalize data where one field has been supplied in different languages. Say, I have a field "subject matter" and it contains both German and Danish values for the same concept. I would ask the service to return reconciliation results in English. The service happily found the relevant entries among the labels in different languages it has for its concepts, and it returned the concept identifier and the preferredLabel in english, if available (and an empty label field if there was no english label). Besides the identifiers themselves, I could thus fill a new column in OpenRefine with the english preferred label for all the rows. If I want to reproduce this now, I will obviously set the
Is it just me or is the W3C i18n Best Practices geared very much towards data publication rather than querying? |
How about not specifying any text processing language? The |
But I want all results to supply the english label. What labels will not setting the Accept-Language produce?
The spec currently says: "If no explicit text-processing language is given, the metadata language (the language of the intended audience) provided first (see service definition) is considered the default text-processing language." If I did provide the "en" metadata language tag (see above), then that would make the reconciliation service consider only the english labels for matching, no? |
Anything that worked before should still be possible, since none of the language-related changes are mandatory. These are all SHOULD or MAY. Maybe we need to be clearer about that in the spec?
No, it only means that the service should assume that the language of the intended audience is English (metadata language) and that the provided labels are in English (default text-processing language, if none is set). What the service does with that information, or if it needs it at all, is up to the service. |
Maybe it makes sense to remove this last assumption, no? |
The reason for that was basically this requirement from the checklist (#125):
So that's for the case of more than one language in the header, but it also addresses this:
The latter could be solved with a I think we can just consider that default text-processing language as a hint. The service can use that information to process the passed values, but if it does not actually need info on the text-processing language, it can simply ignore the fact that there is a default text-processing language. |
I am very sorry for not showing up in today's meeting. I was mistakenly under the impression that we would be meeting tomorrow. I apologize! Having just seen the minutes of today's meeting I think leaving out the intended audience language being interpreted as the default text processing language will definitely ease my worries. Everything else, like pros and cons of presuming a text processing language, how to reflect the intended audience language in the query results or example scenarions and best practices, is maybe better discussed on a wiki page or something like that. But as we are already discussing this: Would it make sense to also reconsider the sentence "The lang value MUST be a single well-formed [BCP 47] language tag." in the beginning of section 8.3? Why should a query not indicate that it intends the query term to be processed in two languages? Again, this is much more about query terms than other fields. (And I acknowledge that I can either send two queries for the same term with two different languages, or not specify a text processing language in the request at all, thereby (hopefully, depending on the service) falling back to "all the languages".) Sorry if this is beating a dead horse. |
I think the main misunderstanding here is that the text-processing language is not an instruction of any kind, to tell the service how or what to process, but an information about the language that a specific string is in. A service can always decide not to care about the language of a given string, and e.g. search for matches in all languages etc. To quote from the W3C docs:
We should probably make clear in the spec what the text-processing language actually is. Assigning myself |
"Text-processing" sounds so ambiguous. We should maybe say "String language" or "human language of the string represented". I'd ideally and more formally (since we're describing an API spec) would rather call it "byte string" since for example UTF-8 can take up anywhere between 1-4 bytes. |
Don't use the first language of the intended audience; see discussion at: #136 (comment)
Based on our discussion in the September meeting: https://etherpad.lobid.org/p/reconciliation-september-2023
Addressed the remaining issues here:
I think we should stick with the terminology from the W3C best practice docs. I hope with the added definitions the ambiguity is gone and people reading the spec will understand what that is about. |
For details see #125 (where each item links to the relevant part of the W3C i18n docs).