Add zh_CN.yaml #6904

taotieren · 2020-12-01T10:55:57Z

No description provided.

jgm · 2020-12-01T17:10:25Z

Most of the translations don't have a country code; would it make sense to make this zh.yaml? Are there other zh localizations that differ significantly from zh_CN?

ickc · 2020-12-01T19:59:20Z

Chinese at least are classified between traditional and simplified and this PR is for simplified.

Moreover, even if we are talking about the same traditional Chinese, the usage between Taiwan's Traditional Chinese is quite different from Hong Kong.

In this case probably we don't need to localized that much (say HK vs Taiwan), so simply discern between trad and sim may be good enough.

One standard I saw is zh-hant and zh-hans, where the Han stands for 漢，and t and s is just the short form of traditional and simplified.

jgm · 2020-12-02T05:00:54Z

Will these names actually work with pandoc? I think pandoc is expecting a standard language + country code. I guess that the _Hant or _Hans is actually part of the language code (not country code), but I'm not sure all the tooling around pandoc can handle a lang with an underscore in it.

At https://stackoverflow.com/questions/4892372/language-codes-for-simplified-chinese-and-traditional-chinese it says that

There are several countries where Chinese is the main written language. The major difference between them is whether they use simplified or traditional characters, but there are also minor regional differences (in vocabulary, etc). The standard way to distinguish these would be with a country code, e.g. zh_CN for mainland China, zh_SG for Singapore, zh_TW for Taiwan, or zh_HK for Hong Kong.

Mainland China and Singapore both use simplified characters, and the others use traditional characters. Since China and Taiwan are the two with the biggest populations, just zh_CN and zh_TW are often used to distinguish the simplified and traditional character versions of a website.

More technically correct but not commonly used in practice, however, would be to use zh_HANS for (generic) simplified Chinese characters, and zh_HANT for traditional Chinese characters, except for rare cases when it is meaningful to distinguish different countries.

I think it would create fewer issues to use zh_CN, zh_TW, zh_HK, zh_SG -- or maybe just zh for the most generic version (e.g. if there are no important differences in this localization between CN and SG), and zh_TW for the traditional version?

mb21 · 2020-12-02T08:08:18Z

From https://pandoc.org/MANUAL.html#language-variables

(following the BCP 47 standard), such as en or en-GB.
The Language subtag lookup tool can look up or verify these tags.

Note that Mandarin vs. Cantonese etc. refers to the languages, while traditional vs. simplified Chinese refers to the scripts usually used to write those languages.

Language

From the Language subtag lookup, we see for example for madarin:

cmn: For use with the zh primary language subtag, ie. as the sequence zh-cmn. However it is usually preferable to replace that sequence with just the cmn primary language subtag. On the other hand, the primary language subtag zh is often preferred by legacy applications for Mandarin Chinese, rather than cmn or zh-cmn.

Script

from https://tools.ietf.org/html/bcp47

Many of these registered tags were made redundant by the advent of
either RFC 4646 or this document. A redundant tag is a grandfathered
registration whose individual subtags appear with the same semantic
meaning in the registry. For example, the tag "zh-Hant" (Traditional
Chinese) can now be composed from the subtags 'zh' (Chinese) and
'Hant' (Han script traditional variant). These redundant tags are
maintained in the registry as records of type 'redundant', mostly as
a matter of historical curiosity.

Interestingly, for example the Mozilla docs uses zh-CN instead of cmn (also in the html lang tag when inspecting the source): https://developer.mozilla.org/zh-CN/

The whole "Using Extended Language Subtags" section is worth a read: https://tools.ietf.org/html/bcp47#section-4.1.2, but basically:

This presents a choice of language tags where previously none existed:

Each encompassed language's subtag SHOULD be used as the primary
language subtag. For example, a document in Mandarin Chinese
would be tagged "cmn" (the subtag for Mandarin Chinese) in
preference to "zh" (Chinese).

If compatibility is desired or needed, the encompassed subtag MAY
be used as an extended language subtag. For example, a document
in Mandarin Chinese could be tagged "zh-cmn" instead of either
"cmn" or "zh".

The macrolanguage or prefixing subtag MAY still be used to form
the tag instead of the more specific encompassed language subtag.
That is, tags such as "zh-HK" or "sgn-RU" are still valid.

ickc · 2020-12-02T09:37:22Z

A bit more info:

As said above there's differences in characters and the spoken languages. Simplified vs traditional is about the character. In this case the file is in simpler Chinese and would be incorrect in traditional Chinese. (Ie we'd need both.)

But it is not entirely true that the spoken languages has no "localizations". The matter is quite complicated.

First we can talk about formal written Chinese, in that case the Mainland China's, Hong Kong's and Taiwan's are quite different. (1st in simplified Chinese, last 2 in Traditional. Perhaps Singaporean Chinese is also different but I don't have experience in that.)

Then there are "lesser formal" Chinese, which can also be written. Such as Cantonese in Hong Kong surrounding areas, and 台語 in Taiwanese which is different from Mandarin, which is like the 2nd most spoken Chinese languages in Taiwan.

And each of these spoken Chinese languages can be totally different in written form comparing to the "formal Chinese" mentioned above.

A very good example will be from Chinese Bible translations. CUV is like the "Chinese King James", and has simplified and traditional variants. There's another 文理和合本 which is in ancient Chinese, sort of like Shakespearean English. Then there's a Cantonese Bible which is completely different, and also a 台語聖經 in that Taiwanese language.

For simplicity, I think traditional vs simplified is good enough for a starter, because the lookup table is very simple here.

I can look into how pandoc should handle it a bit more tomorrow.

taotieren · 2020-12-02T16:52:55Z

thanks，If necessary, I can contact traditional Chinese (zh_TW, zh_HK) Cantonese speakers to improve these translations together.

jgm · 2020-12-02T17:42:10Z

A few more details:

I checked, and the functions in Text.Pandoc.BCP47 do allow language variants.
So in principle you could use zh-Hans. (Note: use a hyphen, not an underscore.)
However, these translations would only be used if you set lang in metadata to zh-Hans (not if you had zh-CN).
In any case, if a country code is used, the file should use a hyphen rather than an underscore to separate it from the language code.

It occurred to me that citeproc might not be expecting the variant tags (e.g. -Hans), but I think that can be fixed.

ickc · 2020-12-02T23:29:16Z

So let's stick with zh-Hans? I can made a zh-Hant PR.

jgm · 2020-12-02T23:36:22Z

However, these translations would only be used if you set lang in metadata to zh-Hans (not if you had zh-CN).

To clarify: if lang is zh-CN, then pandoc will first look for zh-CN.yaml and then fall back to using zh.yaml.
That's the drawback of using zh-Hans and zh-Hant -- neither will be used for users who special lang: zh-CN, lang: zh, or lang: zh-HK, for example.

ickc · 2020-12-02T23:42:34Z

I personally only used zh-Hant and zh-Hans in the past. I think in written Chinese it is the simplest thing people will do.

The reason is simplified Chinese and traditional Chinese has different "character sets" in the unicode (in the past they have their own character sets such as big5 and even Hong Kong variant of big5-*, and the simplified ones has confusingly many gb-* variants.)

e.g. in choosing Chinese fonts in LaTeX, matching the words you type in zh-Han(t|s) is very important as many Chinese fonts only cover either traditional/simplified Chinese.

So for simplicity may be start with having only zh-Hant and zh-Hans first, and only when there's demand we might add more.

jgm · 2020-12-03T00:10:22Z

Well, if we use zh-Han(ts) then we might want to name one of them just zh, so zh-CN will fall back to it.
(Or we could symlink.)

ickc · 2020-12-03T00:33:13Z

A quick search on the internet can't determine what zh alone means.

May be someone else has declared it already, but if not, and we declare it here zh = zh-Hans, it is political and controversial.

Simplified Chinese is the work of the Chinese Communist Party in "recent" history which simplified the characters somewhat, aimed to be easier to learn but proven to show no real advantage in literacy rate; and at the same time it destroys the history around those characters (like the Chinese version of studying etymology etc.) The rest of the Chinese world, except probably only Singapore, still uses Traditional Chinese (not only other Chinese countries but Chinese in other countries.)

Practically speaking, zh-Hant to zh-Hans mapping is surjective as far as I know. Basically when simplified Chinese was designed, multiple traditional Chinese characters are mapped to the same simplified Chinese character. So zh-Hant has more information there.

Hence, one possible approach would be to have zh-Hant only, and uses a library to translate it automatically. One example is OpenCC.

P.S. of course by popular vote zh-Hans will win just because the PRC has more Chinese then anywhere else.

ickc · 2020-12-03T00:41:52Z

e.g. from https://tools.ietf.org/html/bcp47:

To provide compatibility, Chinese languages encompassed by the 'zh'
subtag are in the registry both as primary language subtags and as
extended language subtags. For example, the ISO 639-3 code for
Cantonese is 'yue'. Content in Cantonese might historically have
used a tag such as "zh-HK" (since Cantonese is commonly spoken in
Hong Kong), although that tag actually means any type of Chinese as
used in Hong Kong. With the availability of ISO 639-3 codes in the
registry, content in Cantonese can be directly tagged using the 'yue'
subtag. The content can use it as a primary language subtag, as in
the tag "yue-HK" (Cantonese, Hong Kong). Or it can use an extended
language subtag with 'zh', as in the tag "zh-yue-Hant" (Chinese,
Cantonese, Traditional script).

For example, the macrolanguage Chinese ('zh') encompasses a number of
languages. For compatibility reasons, each of these languages has
both a primary and extended language subtag in the registry. A few
selected examples of these include Gan Chinese ('gan'), Cantonese
Chinese ('yue'), and Mandarin Chinese ('cmn'). Each is encompassed
by the macrolanguage 'zh' (Chinese). Therefore, they each have the
prefix "zh" in their registry records. Thus, Gan Chinese is
represented with tags beginning "zh-gan" or "gan", Cantonese with
tags beginning either "yue" or "zh-yue", and Mandarin Chinese with
"zh-cmn" or "cmn". The language subtag 'zh' can still be used
without an extended language subtag to label a resource as some
unspecified variety of Chinese, while the primary language subtag
('gan', 'yue', 'cmn') is preferred to using the extended language
form ("zh-gan", "zh-yue", "zh-cmn").

Chinese ('zh') provides a useful illustration of this. In the past,
various content has used tags beginning with the 'zh' subtag, with
application-specific meaning being associated with region codes,
private use sequences, or grandfathered registered values. This is
because historically only the macrolanguage subtag 'zh' was available
for forming language tags. However, the languages encompassed by the
Chinese subtag 'zh' are, in the main, not mutually intelligible when
spoken, and the written forms of these languages also show wide
variation in form and usage.

As noted above, applications can choose to use the macrolanguage
subtag to form the tag instead of using the more specific encompassed
language subtag. For example, an application with large quantities
of data already using tags with the 'zh' (Chinese) subtag might
continue to use this more general subtag even for new data, even
though the content could be more precisely tagged with 'cmn'
(Mandarin), 'yue' (Cantonese), 'wuu' (Wu), and so on. Similarly, an
application already using tags that start with the 'ar' (Arabic)
subtag might continue to use this more general subtag even for new
data, which could be more precisely tagged with 'arb' (Standard
Arabic).

From these texts you can't see what zh alone can mean. Especially from the bold sentence it seems to suggest bare "zh" should be used only for historical purposes. Since this is a new "feature" here that no one else has relied on in pandoc before, may be we should just expect people to use the more precise variants (with language subtags.)

Closes #6904, closes #6909. Co-authored-by: taotieren <[email protected]>

jgm · 2020-12-03T05:01:43Z

OK, I'll go with zh-Hans and zh-Hant then. Thanks!

Add zh_CN.yaml

9a9775c

taotieren added 2 commits December 2, 2020 10:21

Update zh_Hans.yaml

7fee12c

Update zh_Hans.yaml

abecbf8

ickc mentioned this pull request Dec 2, 2020

Closes #6904 #6909

Merged

Update zh-Hans.yaml

c8fb7b8

jgm closed this in #6909 Dec 3, 2020

jgm pushed a commit that referenced this pull request Dec 3, 2020

Add translations zh-Hans.yaml and zh-Hant.yaml

aab54c4

Closes #6904, closes #6909. Co-authored-by: taotieren <[email protected]>

stephen-huan mentioned this pull request Jul 5, 2024

Automatically update data/translations #9946

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add zh_CN.yaml #6904

Add zh_CN.yaml #6904

taotieren commented Dec 1, 2020

jgm commented Dec 1, 2020

ickc commented Dec 1, 2020

jgm commented Dec 2, 2020

mb21 commented Dec 2, 2020 •

edited

Loading

ickc commented Dec 2, 2020

taotieren commented Dec 2, 2020 via email •

edited by mb21

Loading

jgm commented Dec 2, 2020 •

edited

Loading

ickc commented Dec 2, 2020

jgm commented Dec 2, 2020

ickc commented Dec 2, 2020 •

edited

Loading

jgm commented Dec 3, 2020

ickc commented Dec 3, 2020

ickc commented Dec 3, 2020

jgm commented Dec 3, 2020

Add zh_CN.yaml #6904

Add zh_CN.yaml #6904

Conversation

taotieren commented Dec 1, 2020

jgm commented Dec 1, 2020

ickc commented Dec 1, 2020

jgm commented Dec 2, 2020

mb21 commented Dec 2, 2020 • edited Loading

Language

Script

ickc commented Dec 2, 2020

taotieren commented Dec 2, 2020 via email • edited by mb21 Loading

jgm commented Dec 2, 2020 • edited Loading

ickc commented Dec 2, 2020

jgm commented Dec 2, 2020

ickc commented Dec 2, 2020 • edited Loading

jgm commented Dec 3, 2020

ickc commented Dec 3, 2020

ickc commented Dec 3, 2020

jgm commented Dec 3, 2020

mb21 commented Dec 2, 2020 •

edited

Loading

taotieren commented Dec 2, 2020 via email •

edited by mb21

Loading

jgm commented Dec 2, 2020 •

edited

Loading

ickc commented Dec 2, 2020 •

edited

Loading