-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exactly 0 or 1 language(s) #213
Comments
Actually, that should probably read: We do not expect that it makes sense to have more than one language value, since we assume that the language property indicates what we call the 'text-processing' language for the annotation body, rather than the language(s) of the intended reader. (The text-processing language is what is used for automatic font assignment, spellchecking, hyphenation, line-breaking in CJK, etc.) |
The language SHOULD is usually interpreted as "do something different at your peril". The MAY language means that there isn't a formal requirement for any specific number. So, there's nothing wrong with the language from the formal perspective. The suggested "read better" would not be an editorial change. The second proposed alternative is even worse, because it makes it equally recommended to have no language associated. The example given before is an mp3 file. What if the file contains a text followed by its interpretation? That would seem to be a case where associating it with a single language may not be the best alternative. |
In conclusion, I suggest to close the issue without change. |
Having exchange brief emails with Ivan, i think we may need to take a step back and come at this from a wider perpective. Ivan told me
So the i18n WG's initial assumptions were incorrect. Let me try to outline why i'm concerned. If an application is going to use the language value provided to perform an operation on the text, it often needs to know what language the text is actually in. For example, such an operation might be running a spellchecker, pronouncing the text in a voice browser, applying hyphenation, case conversion, line breaking and other language-sensitive actions, applying fonts, etc. In these cases it's problematic if you have a list of languages as the value of your The i18n WG tends to refer to another type of language annotation as 'metadata'. This typically indicates the intended linguistic audience of the resource as a whole, and it's possible to imagine that this could, for a multilingual resource, involve a property value that is a list of languages. It may be that the 'language' property when referring to a target is of the metadata kind (since it's informative, the target is not being operated on, and the target ought to have its own text-processing language declarations), whereas it may be more useful to see the language of the body as of the text-processing kind, since that kind of information can be used to indicate to a voice browser how to pronounce the annotation, or to a graphical browser how to break lines of text when displaying the annotation, etc.(?) In order to know how to specify the content of the values for the Hopefully that clarifies our frame of reference, although it doesn't yet provide a clear way forward. (There will, of course, be an additional question wrt text-processing language declarations, in that a content author may need to indicate that parts of the annotation are in different languages, though i'm not clear how much of an issue it will be for annotations if that level of detail is not provided. It may not be a common use case, or one that causes major difficulties if missing(?) However, it is usually important to have at least a default idea of which language to assume for purposes of processing the text of the annotation, in order to manage the text when it comes to display or use.) |
If language annotation can be (is required to be) fine grained enough For bulk content (a whole book, a whole movie) limiting the annotation A./ |
well ... there many documents that have multiple languages, in europeana there are ~4 million records out of 53, that are marked to contain multiple languages. I think targets and bodies must support multiple languages, even if there are some drawbacks given the fact that it is not clear which parts of the text are written in one language and which are written in another one. |
to complete an action from the i18n WG, i wrote a summary of what i understand wrt use cases for language information in web annotations (with help from @fsasaki ). See |
Thanks @r12a for writing all this down. And, at the moment, I am torn.
I believe the way forward is to say something like: "an annotation SHOULD have zero or one language terms, and MAY have more than 1 in exceptional cases." @gsergiu's use case may be quoted in an informal note where the MAY comes into effect, but we should also note that implementations/users should really try to use one language, because otherwise problems may occur. And stop there... |
On 5/23/2016 9:27 AM, Ivan Herman wrote:
SHOULD already implies that one has a good reason for a different The original language had it right: "The Body or Target SHOULD have exactly 1 language associated with it, As this seemed contradictory to some, perhaps what is needed is an "The Body or Target SHOULD have exactly 1 language associated with it, (My example for "0" may not be what was intended, so just fix accordingly). |
My 2c:
Then if there's the case when there are multiple languages and there's a need to specify which one to use for text processing, there's somewhere to do it. However for the simple (and frequent) case of a single language, then the client knows it should use the language property rather than repeat it in both fields. Thoughts? |
That is an acceptable compromise.
|
Dear all, I would make a simple synthesis of the problem from the implementation point of view: Facts: · There are many web resources that use multiple languages (and of course we want that everything is annotatable) · There are also many of these resources that even don’t use metadata or markup to advertise the use languages As the goal is to be able to everything, we can even take in consideration the worst case scenario, in which we have the resources that include texts in multiple languages, but we don’t know which languages are used. Expected user behavior: · I think that the majority of users would agree to add the used language (list) when creating annotations. (mainly for retrieval purposes) · I don’t think that will be many users that are willing to mark all parts of the texts with the correctly identified language, but there will be use cases in which this is needed · Audio browsers might be nice and important, especially for blind people, but I doubt that they are able to correctly read texts in any language and especially old languages (I’m not sure if we have readers that are able to read latin or old german for example, which are frequently used in Europeana resources: See http://www.europeana.eu/portal/record/92080/FCBC03581F63DA47F920E30CF3000212D7A476F1.html Analysis:
Proposed Approach:
a. Open question, do we really need text direction if we have the script code? Cannot the text direction be derived from the script code #224 ?
a. As written above, I think that the best way is to have a special (robust?) selector for adding the missing i18n information! Just let the body to have a clean representation, which is human and machine friendly .. (opposite to browser friendly and human/machine unfriendly, the json representation should be json and not html .. or other markup) Br, Sergiu Von: Ivan Herman [mailto:[email protected]] That is an acceptable compromise.
— |
PS: personally I would preffer to have the script code in a separate field (for normalization purposes), but it seems this is not the RFC way of doing it... (However implementations can perform this normalization especially for search purposes) |
@azaroth42 's proposal cuts to the centre of what i see as the problem, which is that given a list of languages it's ambiguous which to use for the default text-processing language, so if it's workable to have the processingLanguage property i think that would probably solve the issue. Just a suggestion: for additional clarity, it may help to add some wording along the lines proposed by Asmus, such as: |
@r12a @azaroth42 We should first make it clear, what is this used for? It it used only for being able to select "some" text processing algorithms to be applied for the text. Or is it intended to select the "correct" text processing algorithms? Who should add this information into the annotation? Is it the end user (in general case I doubt that this is user's responsability) or is it the implementation (we might think that the first entry that matches some rules from the language property can be copied in the processingLanguage) or both? If you have ~40% text in rusian and ~60% in english... what do we do in this case? Should we say that the text should be processed only by russian NLP or only by english NLP. I would expect to be processed by both. |
Hi all, I believe that the "processingLanguage" will not solve the issue as it is still necessary to choose one of the languages (if more than 1 exists) which may not be possible to do. If the issue is for client application to decide if the text fits the language of the display, then they can just check if the language is one of the languages in dc:language and accept that part of it may be in a different language than the one that the user has selected. If the issue is for software processing the text to apply a specific NLP then it can either still try to apply it and accept that the results may not be the best, or just ignore it as there is no sufficient information to apply them. Best regards, Hugo |
correct, given that we have a list of languages provided by the annotation creator, why do we need a "default processing language"? Isn't this the responsability of the clients to decide which of the given languages should/can be used by the NLP algorithms which are known only by the client? I don't think that the annotation creator should normatively enforce a processing language .. (in any case, including the 1 language scenario). |
For the record, I won't object against the resolution proposed at |
Discussed on a joined call with the I18N WG, on 2016-05-26, resolution is to accept the proposal in: #213 (comment) |
@hugomanguinhas, @gsergiu: It is also important to remember that there's the possibility of using HTML or other serialization that can record language, fonts, and so forth within the target or body resource. Then it is up to the rendering client to process that as specified by the format's specification. |
Well ... if the solution is to add some reduncancy becasue some people/scenarios needed, I have not problem with that given that these fields are not mandatory. As I indicated above, that "script code" part of RFC 5646 is the key information needed by these algorithms. While this bit of information is still valid to be added in the "language" property (at least accordign to the current specifications), this is not the recommended way to do it. Was this aspect discussed? By following the other things that got own fields, like "text direction", I would claim that the "script code" should be also explicitly represented in the annotations. If there was no decision/recommendation taken in this direction, I would be glad to create a new ticket. BR, |
We did discuss, on a slightly more general level, that this solution will not cover all the possible cases and, because the format of the body and target is completely open-ended, it is impossible to cover them. It was agreed that this solution covers the vast majority of the cases (the magic 80/20 cut…) and we would stop there. |
Ok .. thanks for the answer. Does this imply that the WA recommendation will sound "use the script code in langauge tag, if you need it"? ... I can leave with this, but this is against the i18n recommendations, and I would say it is worth to write an informative note on it. Br, |
I don't think we need to say that explicitly, it's part of BCP47 and hence is available for use along with all of the other features. |
well ... this is a kind of implicit recommendation, which would make sense to be added as note, especially given the 20% mentioned by @iherman that are not covered by the current solution of using processingLanguage and textDirection. These 20% are covered by placing the script code in the language tag, but currently there is no mention of the script code in the WA draft (I supose). I would propose to add a note for the language like: http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry |
https://www.w3.org/TR/2016/WD-annotation-model-20160331/#bodies-and-targets
The i18n WG thought this would read better as "The Body or Target SHOULD have exactly 0 or 1 language(s) associated with it." Current wording seems a little odd.
The text was updated successfully, but these errors were encountered: