Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base direction for annotations #224

Closed
r12a opened this issue May 17, 2016 · 25 comments
Closed

Base direction for annotations #224

r12a opened this issue May 17, 2016 · 25 comments
Assignees
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.

Comments

@r12a
Copy link

r12a commented May 17, 2016

In addition to language information, each annotation may need an optional indicator of overall base direction.

For example, the following annotation will not display correctly unless the application doing the display knows that the base direction needs to be rtl. (As it is, the 'W3C' will appear to the right, as shown here, rather than to the left of the Hebrew.)

{
  "@context": "http://www.w3.org/ns/anno.jsonld",
  "id": "http://example.org/anno5",
  "type":"Annotation",
  "body": {
    "type" : "TextualBody",
    "text" : "<p>פעילות הבינאום, W3C</p>",
    "format" : "text/html",
    "language" : "he"
    "direction" : "rtl"
  },
  "target": "http://example.org/photo1"
}
@iherman
Copy link
Member

iherman commented May 17, 2016

Can't we rely on the (HTML) content of the "text" to provide anything that is needed? (Discussed on the F2F meeting 17.05.2016)

@azaroth42
Copy link
Collaborator

Can we add the information to additional html tags in the text field? e.g.

<p><span xml:lang="...">...</span> <span xml:lang="en">W3C</span></p>

@csarven
Copy link
Member

csarven commented May 17, 2016

@azaroth42 That's interesting to explore, however I think that might put additional stress (requirement?) on normalisation of the data. If HTML is used, they'll go with lang="en" and XHTML would with xml:lang="en" and perhaps Polygloth would need to do both. This is on top of: perhaps rdf:HTML and rdf:XMLLiteral blocks should not be preserved as such. Same goes for the direction.

There is a tradeoff somewhere :) I have some preference to: "don't touch the source" but "enrich" via adding language and direction to the annotation. I realise that comes across a bit clumsy.

@csarven
Copy link
Member

csarven commented May 17, 2016

I think it is worthwhile to also keep in mind future hashing of the original content and matching that with what's in the annotation. They won't match if the annotation is adding information into the text that's not at the source. In that case, it would require integrity checks to also normalize before comparing the hashes.

@iherman
Copy link
Member

iherman commented May 18, 2016

Discussed F2F 18.05.2016: accept by adding the relevant term for directionality.

RESOLUTION: Add a direction property to the vocabulary, to be associated with any content resource (body or target) with three possible values, auto, rtl and ltr (in JSON-LD) and define URIs to identify the concepts. Refer back to HTML5 document for the definitions.

See: http://www.w3.org/2016/05/18-annotation-irc#T07-56-24

@azaroth42
Copy link
Collaborator

Also CSS for the same values: https://developer.mozilla.org/en/docs/Web/CSS/direction

CSVW defines the ltr/rtl instances: https://www.w3.org/ns/csvw#instance-definitions
And an unusable predicate: and https://www.w3.org/ns/csvw#textDirection (due to the overly restrictive domain in the ontology). So we need to duplicate it.

So I propose additions to context and vocab:

"csvw" : "https://www.w3.org/ns/csvw#"
"direction": "wa:textDirection",
"ltr" : "csvw:ltr",
"rtl": "csvw:rtl",
"auto": "csvw:auto"

and to decrement 🍻 owed to @gkellogg by one :)

Also, please note there is a discrepancy between: https://www.w3.org/TR/tabular-metadata/ which says the range of textDirection is a string, and the actual vocabulary, which defines a range of csvw:Direction (of which rtl etc are instances).

🍻 owed-- again

@iherman
Copy link
Member

iherman commented May 19, 2016

On 19 May 2016, at 18:22, Rob Sanderson [email protected] wrote:

Also CSS for the same values: https://developer.mozilla.org/en/docs/Web/CSS/direction https://developer.mozilla.org/en/docs/Web/CSS/direction
CSVW defines the ltr/rtl instances: https://www.w3.org/ns/csvw#instance-definitions https://www.w3.org/ns/csvw#instance-definitions
And an unusable predicate: and https://www.w3.org/ns/csvw#textDirection https://www.w3.org/ns/csvw#textDirection (due to the overly restrictive domain in the ontology). So we need to duplicate it.

So I propose additions to context and vocab:

"csvw" : "https://www.w3.org/ns/csvw#"
"direction": "wa:textDirection",
"ltr" : "csvw:ltr",
"rtl": "csvw:rtl",
"auto": "csvw:auto"
This is a bit of a bike shedding, but…I am not sure whether it is good of bringing in a new namespace for these three. Yes, I know, reuse namespaces if necessary, it does not count for the JSON-LD, etc, etc, but nevertheless, I am not 100% sure it is worth brining in a new one to the Turtle. I have a slight (emphasis: slight) preference to have these three values duplicated in our own namespace.

Ivan

and to decrement 🍻 owed to @gkellogg https://github.com/gkellogg by one :)

Also, please note there is a discrepancy between: https://www.w3.org/TR/tabular-metadata/ https://www.w3.org/TR/tabular-metadata/ which says the range of textDirection is a string, and the actual vocabulary, which defines a range of csvw:Direction (of which rtl etc are instances).

🍻 owed-- again


You are receiving this because you commented.
Reply to this email directly or view it on GitHub #224 (comment)


Ivan Herman, W3C
Digital Publishing Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704

@azaroth42
Copy link
Collaborator

I'm easy either way. If we didn't have to mint a new predicate I would be more strongly in favor of reuse, but as we need our own copy of textDirection, we can just as easily redefine the values as well.

@azaroth42 azaroth42 self-assigned this May 20, 2016
@r12a
Copy link
Author

r12a commented May 20, 2016

[removed incorrect comment]

CSVW defines the ltr/rtl instances: https://www.w3.org/ns/csvw#instance-definitions

i don't remember seeing this before. I have some issues with the definitions (apart from the typos 'Determins' and 'Indiects'). Do you need me to elaborate them here?

@iherman
Copy link
Member

iherman commented May 20, 2016

On 20 May 2016, at 09:36, Rob Sanderson [email protected] wrote:

I'm easy either way. If we didn't have to mint a new predicate I would be more strongly in favor of reuse, but as we need our own copy of textDirection, we can just as easily redefine the values as well.

Then I would definitely propose to do that.

@iherman
Copy link
Member

iherman commented May 20, 2016

On 20 May 2016, at 10:27, r12a [email protected] wrote:

CSVW defines the ltr/rtl instances: https://www.w3.org/ns/csvw#instance-definitions https://www.w3.org/ns/csvw#instance-definitions
i don't remember seeing this before. I have some issues with the definitions (apart from the typos 'Determins' and 'Indiects'). Do you need me to elaborate them here?

Well… if there are issues with the CSVW stuff, then this should be an errata in the CSVW errata page. But I think it now accepted that we would use our own terms in the case of annotation, so it is irrelevant for this thread...

@iherman
Copy link
Member

iherman commented May 20, 2016

On 20 May 2016, at 11:05, Ivan Herman [email protected] wrote:

On 20 May 2016, at 10:27, r12a <[email protected] mailto:[email protected]> wrote:

CSVW defines the ltr/rtl instances: https://www.w3.org/ns/csvw#instance-definitions https://www.w3.org/ns/csvw#instance-definitions
i don't remember seeing this before. I have some issues with the definitions (apart from the typos 'Determins' and 'Indiects'). Do you need me to elaborate them here?

Well… if there are issues with the CSVW stuff, then this should be an errata in the CSVW errata page. But I think it now accepted that we would use our own terms in the case of annotation, so it is irrelevant for this thread...

Oops, I just see that you have done that, sorry for the noise!

@gkellogg
Copy link
Member

@azaroth42 said:

Also, please note there is a discrepancy between: https://www.w3.org/TR/tabular-metadata/ which says the range of textDirection is a string, and the actual vocabulary, which defines a range of csvw:Direction (of which rtl etc are instances).

Note that the CSVW metadata document is JSON, compatible with JSON-LD, so in that document, the values for direction must be specified as strings. However, this is not inconsistent with the RDF interpretation being an object; thus the instance definitions and range limitation.

@gsergiu
Copy link

gsergiu commented May 24, 2016

@r12a shouldn't be the text direction easily identified from the script codes?
http://unicode.org/iso15924/iso15924-codes.html

@gsergiu
Copy link

gsergiu commented May 24, 2016

It seems to me that this ticket started from a wrong example:

"text" : "<p>פעילות הבינאום, W3C</p>",
    "format" : "text/html",
    "language" : "en"
    "direction" : "rtl"

In the provided example, the language is not "en" according to my understanding, but hebrew, with some enghlish text (except if the provided hebrew text means W3C ... for which we would have 50% en and 50% hebrew).

I think it is dangerous to make standardization decisions starting from wrong examples, and proposing incomplete solutions.

  1. Probalby you want to say that W3c should be understood as "en" text, even if it would match any latin based language+script, but that doesn't change the true nature of the text!
  2. The text is written in 2 languages, and 2 scripts .. and the proposed "base direction "change doesn't solve the problem of not being able to correctly represent the text. see also the analysis I submitted to exactly 0 or 1 language(s) #213

@r12a
Copy link
Author

r12a commented May 24, 2016

the use of 'en' in the example was a mistake. I have corrected it (to 'he').

@r12a
Copy link
Author

r12a commented May 24, 2016

@r12a shouldn't be the text direction easily identified from the script codes?

It's a question we're asked a lot, and the answer is no. As a hack it sometimes works to infer the base direction from the language information when no real direction information is available, but it's unreliable. Note in particular that BCP47 strongly encourages you not to use script tags unless necessary to distinguish usages, and in fact has a mechanism to indicate that script should not be used with certain language tags. Moreover, language and direction are not the same thing. For example, how would you express 'auto' with language tags?

@gsergiu
Copy link

gsergiu commented May 24, 2016

if the language is "he" don#t we already know that the text is RTL?
http://www.i18nguy.com/temp/rtl.html

Are there any languages that use both RTL and LTR?
Probably not, but even if ... whouldn't this be solved by the script code?

@gsergiu
Copy link

gsergiu commented May 24, 2016

  1. well .. obviously there are some languages for which you need to know the "script" fro correct representation. Probably the japanese is the best known example. Meaning that ... there will be implementations that will use the script part of the language encoding. I think this is a fact. We cannot and we should not prevent this.
  2. I do recognize that I don't like adding the script in the language either. One could think if it makes more sense to use only language and country codes in the language field and to put the script in its own field.
  3. However my basic question is if the script code is not the "richer" information for correct representation of the texts?

If the language + script code clearly indicates the writing direction, I would suggest adding the script code information to annotation and not the "text direction" which will. in this case be redundant information.

@r12a
Copy link
Author

r12a commented May 24, 2016

if the language is "he" don#t we already know that the text is RTL?

Not if the text is transcribed in latin script or some other script (the authors of BCP47 were explaining how that works to someone just last week, as it happens). Yes, people should use a script tag in that case, but nothing forces them to.

Are there any languages that use both RTL and LTR?

Yes. For example, Azerbaijani.

Language is not the same thing semantically as direction: there are different parameters to its use, and the places where you need to use it are different. We have been going around this tree for years, please just trust us.

@gsergiu
Copy link

gsergiu commented May 24, 2016

well ... I recognize that I'm not the expert in the field, that's why I ask questions.
I agree with you, that the language is not the one that dictates the direction, but I assume that the scripts are.

If I understood it correctly ... we have languages which use both RTL scripts (mainly based on arabic alphabet) and LTR script (probalby all others).
https://en.wikipedia.org/wiki/Azerbaijani_alphabet

So .. this is my basic question. Does the language + script clearly identify the direction of the text?
If yes, I consider the textDirection to be redundant information. I don't really mean that we shouldn't have such a field, but I want to have it clearly stated in the standard, which is the relationship and what to do if both exists but are inconsistent?

My personal preference would be that language+script should be "the master" as this is already an ISO standard.

@gsergiu
Copy link

gsergiu commented May 25, 2016

@r12a .. just to conclude the analysis, not the solution .. I would like to ask the following question:

Are the language + script codes (e.g. az-Arab | az-Cyrl | az-Latn) sufficient for correct representation of the text, including the font selection and the direction of the text?
(if yes ... than the main question of this issue, turns into: Do we need redundandancy for easier processing of the annotations? )

(here some references for others that want to provide feedback:
Script codes: http://unicode.org/iso15924/iso15924-codes.html
i18n QA: http://www.i18nguy.com/temp/rtl.html
W3C i18n script subtag recommendations: https://www.w3.org/International/articles/language-tags/#script
IANA subtag registry: http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
W3C i18n recommendations/QA on language tags: https://www.w3.org/International/questions/qa-choosing-language-tags )

@r12a
Copy link
Author

r12a commented May 25, 2016

@gsergiu please see my earlier comments. Here are four reasons i mentioned why you can't conflate language and direction, i may be able to come up with more: (1) you can't produce the auto value with language tags, (2) BCP47 recommends that you do not use script tags for languages like Hebrew (suppressscript: Hebr), (3) you won't be able to rely on people supplying script tags as part of the language information in order to influence direction, (4) these are semantically separate concepts. Another reason that i alluded to but didn't expand is that if you apply direction to inline content, it becomes even clearer that we are dealing with different things because the usage patterns don't overlap.

Are the language + script codes (e.g. az-Arab | az-Cyrl | az-Latn) sufficient for correct representation of the text, including the font selection and the direction of the text?

that would be a no, then.

@gsergiu
Copy link

gsergiu commented May 25, 2016

Well .. I just tried to make the analysis of the issue.

  1. So .. with the current specification we have a way to correctly represent the text but you claim that this is not the recommended way of doing it (on which I agree, that it was a bad idea from the begining to mix the language with the script concepts).
  2. However, in the case that we don't want to include the script tag in the language. I would say that the script tag should have an own property, in which case the textDirection is redundant, as it can be unanbiguously derived from the script code (and language information eventually). Additionally the script code can help clients to choose the correct fonts for representing the text, while the text direction is not helping in this matter.
    Moreover, I think that the default script for each language can be derived from the "Suppress-Script" field in IANA registry: http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

PS:
I'm not trying to impose a solution, I just wanted to analyze the problem and existing solutions/standards (and I'm trying myself to derive my own unbiased opinion).
The community can adopt the most approapriate solution, but I claim that the solution must solve both problems: text direction and font selection. It should also take in account the reuse of standards and best practices.

@iherman
Copy link
Member

iherman commented May 26, 2016

On the meeting with the I18N WG (2016-05-26) the Anno WG reiterated that it intends to follow the direction as advised by the I18N WG. The decision in #224 (comment) sticks.

See http://www.w3.org/2016/05/26-i18n-irc#T15-13-10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Projects
None yet
Development

No branches or pull requests

7 participants