Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zh_CN.yaml #6904

Closed
wants to merge 4 commits into from
Closed

Add zh_CN.yaml #6904

wants to merge 4 commits into from

Conversation

taotieren
Copy link
Contributor

No description provided.

@jgm
Copy link
Owner

jgm commented Dec 1, 2020

Most of the translations don't have a country code; would it make sense to make this zh.yaml? Are there other zh localizations that differ significantly from zh_CN?

@ickc
Copy link
Contributor

ickc commented Dec 1, 2020

Chinese at least are classified between traditional and simplified and this PR is for simplified.

Moreover, even if we are talking about the same traditional Chinese, the usage between Taiwan's Traditional Chinese is quite different from Hong Kong.

In this case probably we don't need to localized that much (say HK vs Taiwan), so simply discern between trad and sim may be good enough.

One standard I saw is zh-hant and zh-hans, where the Han stands for 漢,and t and s is just the short form of traditional and simplified.

@jgm
Copy link
Owner

jgm commented Dec 2, 2020

Will these names actually work with pandoc? I think pandoc is expecting a standard language + country code. I guess that the _Hant or _Hans is actually part of the language code (not country code), but I'm not sure all the tooling around pandoc can handle a lang with an underscore in it.

At https://stackoverflow.com/questions/4892372/language-codes-for-simplified-chinese-and-traditional-chinese it says that

There are several countries where Chinese is the main written language. The major difference between them is whether they use simplified or traditional characters, but there are also minor regional differences (in vocabulary, etc). The standard way to distinguish these would be with a country code, e.g. zh_CN for mainland China, zh_SG for Singapore, zh_TW for Taiwan, or zh_HK for Hong Kong.

Mainland China and Singapore both use simplified characters, and the others use traditional characters. Since China and Taiwan are the two with the biggest populations, just zh_CN and zh_TW are often used to distinguish the simplified and traditional character versions of a website.

More technically correct but not commonly used in practice, however, would be to use zh_HANS for (generic) simplified Chinese characters, and zh_HANT for traditional Chinese characters, except for rare cases when it is meaningful to distinguish different countries.

I think it would create fewer issues to use zh_CN, zh_TW, zh_HK, zh_SG -- or maybe just zh for the most generic version (e.g. if there are no important differences in this localization between CN and SG), and zh_TW for the traditional version?

@mb21
Copy link
Collaborator

mb21 commented Dec 2, 2020

From https://pandoc.org/MANUAL.html#language-variables

(following the BCP 47 standard), such as en or en-GB.
The Language subtag lookup tool can look up or verify these tags.

Note that Mandarin vs. Cantonese etc. refers to the languages, while traditional vs. simplified Chinese refers to the scripts usually used to write those languages.

Language

From the Language subtag lookup, we see for example for madarin:

cmn: For use with the zh primary language subtag, ie. as the sequence zh-cmn. However it is usually preferable to replace that sequence with just the cmn primary language subtag. On the other hand, the primary language subtag zh is often preferred by legacy applications for Mandarin Chinese, rather than cmn or zh-cmn.

Script

from https://tools.ietf.org/html/bcp47

Many of these registered tags were made redundant by the advent of
either RFC 4646 or this document. A redundant tag is a grandfathered
registration whose individual subtags appear with the same semantic
meaning in the registry. For example, the tag "zh-Hant" (Traditional
Chinese) can now be composed from the subtags 'zh' (Chinese) and
'Hant' (Han script traditional variant). These redundant tags are
maintained in the registry as records of type 'redundant', mostly as
a matter of historical curiosity.


Interestingly, for example the Mozilla docs uses zh-CN instead of cmn (also in the html lang tag when inspecting the source): https://developer.mozilla.org/zh-CN/

The whole "Using Extended Language Subtags" section is worth a read: https://tools.ietf.org/html/bcp47#section-4.1.2, but basically:

This presents a choice of language tags where previously none existed:

  • Each encompassed language's subtag SHOULD be used as the primary
    language subtag. For example, a document in Mandarin Chinese
    would be tagged "cmn" (the subtag for Mandarin Chinese) in
    preference to "zh" (Chinese).

  • If compatibility is desired or needed, the encompassed subtag MAY
    be used as an extended language subtag. For example, a document
    in Mandarin Chinese could be tagged "zh-cmn" instead of either
    "cmn" or "zh".

  • The macrolanguage or prefixing subtag MAY still be used to form
    the tag instead of the more specific encompassed language subtag.
    That is, tags such as "zh-HK" or "sgn-RU" are still valid.

@ickc
Copy link
Contributor

ickc commented Dec 2, 2020

A bit more info:

As said above there's differences in characters and the spoken languages. Simplified vs traditional is about the character. In this case the file is in simpler Chinese and would be incorrect in traditional Chinese. (Ie we'd need both.)

But it is not entirely true that the spoken languages has no "localizations". The matter is quite complicated.

First we can talk about formal written Chinese, in that case the Mainland China's, Hong Kong's and Taiwan's are quite different. (1st in simplified Chinese, last 2 in Traditional. Perhaps Singaporean Chinese is also different but I don't have experience in that.)

Then there are "lesser formal" Chinese, which can also be written. Such as Cantonese in Hong Kong surrounding areas, and 台語 in Taiwanese which is different from Mandarin, which is like the 2nd most spoken Chinese languages in Taiwan.

And each of these spoken Chinese languages can be totally different in written form comparing to the "formal Chinese" mentioned above.

A very good example will be from Chinese Bible translations. CUV is like the "Chinese King James", and has simplified and traditional variants. There's another 文理和合本 which is in ancient Chinese, sort of like Shakespearean English. Then there's a Cantonese Bible which is completely different, and also a 台語聖經 in that Taiwanese language.

For simplicity, I think traditional vs simplified is good enough for a starter, because the lookup table is very simple here.

I can look into how pandoc should handle it a bit more tomorrow.

@taotieren
Copy link
Contributor Author

taotieren commented Dec 2, 2020 via email

@jgm
Copy link
Owner

jgm commented Dec 2, 2020

A few more details:

I checked, and the functions in Text.Pandoc.BCP47 do allow language variants.
So in principle you could use zh-Hans. (Note: use a hyphen, not an underscore.)
However, these translations would only be used if you set lang in metadata to zh-Hans (not if you had zh-CN).
In any case, if a country code is used, the file should use a hyphen rather than an underscore to separate it from the language code.

It occurred to me that citeproc might not be expecting the variant tags (e.g. -Hans), but I think that can be fixed.

@ickc
Copy link
Contributor

ickc commented Dec 2, 2020

So let's stick with zh-Hans? I can made a zh-Hant PR.

@jgm
Copy link
Owner

jgm commented Dec 2, 2020

However, these translations would only be used if you set lang in metadata to zh-Hans (not if you had zh-CN).

To clarify: if lang is zh-CN, then pandoc will first look for zh-CN.yaml and then fall back to using zh.yaml.
That's the drawback of using zh-Hans and zh-Hant -- neither will be used for users who special lang: zh-CN, lang: zh, or lang: zh-HK, for example.

@ickc ickc mentioned this pull request Dec 2, 2020
@ickc
Copy link
Contributor

ickc commented Dec 2, 2020

I personally only used zh-Hant and zh-Hans in the past. I think in written Chinese it is the simplest thing people will do.

The reason is simplified Chinese and traditional Chinese has different "character sets" in the unicode (in the past they have their own character sets such as big5 and even Hong Kong variant of big5-*, and the simplified ones has confusingly many gb-* variants.)

e.g. in choosing Chinese fonts in LaTeX, matching the words you type in zh-Han(t|s) is very important as many Chinese fonts only cover either traditional/simplified Chinese.

So for simplicity may be start with having only zh-Hant and zh-Hans first, and only when there's demand we might add more.

@jgm
Copy link
Owner

jgm commented Dec 3, 2020

Well, if we use zh-Han(ts) then we might want to name one of them just zh, so zh-CN will fall back to it.
(Or we could symlink.)

@ickc
Copy link
Contributor

ickc commented Dec 3, 2020

A quick search on the internet can't determine what zh alone means.

May be someone else has declared it already, but if not, and we declare it here zh = zh-Hans, it is political and controversial.

Simplified Chinese is the work of the Chinese Communist Party in "recent" history which simplified the characters somewhat, aimed to be easier to learn but proven to show no real advantage in literacy rate; and at the same time it destroys the history around those characters (like the Chinese version of studying etymology etc.) The rest of the Chinese world, except probably only Singapore, still uses Traditional Chinese (not only other Chinese countries but Chinese in other countries.)

Practically speaking, zh-Hant to zh-Hans mapping is surjective as far as I know. Basically when simplified Chinese was designed, multiple traditional Chinese characters are mapped to the same simplified Chinese character. So zh-Hant has more information there.

Hence, one possible approach would be to have zh-Hant only, and uses a library to translate it automatically. One example is OpenCC.

P.S. of course by popular vote zh-Hans will win just because the PRC has more Chinese then anywhere else.

@ickc
Copy link
Contributor

ickc commented Dec 3, 2020

e.g. from https://tools.ietf.org/html/bcp47:

To provide compatibility, Chinese languages encompassed by the 'zh'
subtag are in the registry both as primary language subtags and as
extended language subtags. For example, the ISO 639-3 code for
Cantonese is 'yue'. Content in Cantonese might historically have
used a tag such as "zh-HK" (since Cantonese is commonly spoken in
Hong Kong), although that tag actually means any type of Chinese as
used in Hong Kong. With the availability of ISO 639-3 codes in the
registry, content in Cantonese can be directly tagged using the 'yue'
subtag. The content can use it as a primary language subtag, as in
the tag "yue-HK" (Cantonese, Hong Kong). Or it can use an extended
language subtag with 'zh', as in the tag "zh-yue-Hant" (Chinese,
Cantonese, Traditional script).

For example, the macrolanguage Chinese ('zh') encompasses a number of
languages. For compatibility reasons, each of these languages has
both a primary and extended language subtag in the registry. A few
selected examples of these include Gan Chinese ('gan'), Cantonese
Chinese ('yue'), and Mandarin Chinese ('cmn'). Each is encompassed
by the macrolanguage 'zh' (Chinese). Therefore, they each have the
prefix "zh" in their registry records. Thus, Gan Chinese is
represented with tags beginning "zh-gan" or "gan", Cantonese with
tags beginning either "yue" or "zh-yue", and Mandarin Chinese with
"zh-cmn" or "cmn". The language subtag 'zh' can still be used
without an extended language subtag to label a resource as some
unspecified variety of Chinese, while the primary language subtag
('gan', 'yue', 'cmn') is preferred to using the extended language
form ("zh-gan", "zh-yue", "zh-cmn").

Chinese ('zh') provides a useful illustration of this. In the past,
various content has used tags beginning with the 'zh' subtag, with
application-specific meaning being associated with region codes,
private use sequences, or grandfathered registered values. This is
because historically only the macrolanguage subtag 'zh' was available
for forming language tags. However, the languages encompassed by the
Chinese subtag 'zh' are, in the main, not mutually intelligible when
spoken, and the written forms of these languages also show wide
variation in form and usage.

As noted above, applications can choose to use the macrolanguage
subtag to form the tag instead of using the more specific encompassed
language subtag. For example, an application with large quantities
of data already using tags with the 'zh' (Chinese) subtag might
continue to use this more general subtag even for new data, even
though the content could be more precisely tagged with 'cmn'
(Mandarin), 'yue' (Cantonese), 'wuu' (Wu), and so on.
Similarly, an
application already using tags that start with the 'ar' (Arabic)
subtag might continue to use this more general subtag even for new
data, which could be more precisely tagged with 'arb' (Standard
Arabic).

From these texts you can't see what zh alone can mean. Especially from the bold sentence it seems to suggest bare "zh" should be used only for historical purposes. Since this is a new "feature" here that no one else has relied on in pandoc before, may be we should just expect people to use the more precise variants (with language subtags.)

@jgm jgm closed this in #6909 Dec 3, 2020
jgm pushed a commit that referenced this pull request Dec 3, 2020
@jgm
Copy link
Owner

jgm commented Dec 3, 2020

OK, I'll go with zh-Hans and zh-Hant then. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants