Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine Sort Ordering for zh-CN #1

Open
drmacro opened this issue Oct 18, 2016 · 12 comments
Open

Refine Sort Ordering for zh-CN #1

drmacro opened this issue Oct 18, 2016 · 12 comments
Assignees

Comments

@drmacro
Copy link
Contributor

drmacro commented Oct 18, 2016

From Toshihiko Makita in a post to the DITA User's list:

Here is the excerpt from org.dita-community.i18n-develop/src/main/resources/lookup-zh-cn.xml.

<?xml version="1.0" encoding="UTF-8"?>
<lookupTable>
   <item key="公安" value="gong1 an1"/>
   <item key="公安官员" value="gong1 an1 guan1 yuan2"/>
   <item key="公安机关" value="gong1 an1 ji1 guan1"/>
   <item key="公安局" value="gong1 an1 ju2"/>
   <item key="公案" value="gong1 an4"/>
   <item key="功败垂成" value="gong1 bai4 chui2 cheng2"/>
   <item key="公办" value="gong1 ban4"/>
   <item key="宫保鸡丁" value="gong1 bao3 ji1 ding1"/>
   <item key="公报" value="gong1 bao4"/>
   <item key="宫爆鸡丁" value="gong1 bao4 ji1 ding1"/>
   <item key="宫爆肉丁" value="gong1 bao4 rou4 ding1"/>
   <item key="公报私仇" value="gong1 bao4 si1 chou2"/>
   <item key="弓背" value="gong1 bei4"/>
   <item key="公倍式" value="gong1 bei4 shi4"/>
   <item key="公倍数" value="gong1 bei4 shu4"/>
   <item key="工笔" value="gong1 bi3"/>
   <item key="攻砭" value="gong1 bian1"/>
   <item key="工兵" value="gong1 bing1"/>
   <item key="公秉" value="gong1 bing3"/>
   <item key="公布" value="gong1 bu4"/>
   <item key="功不可没" value="gong1 bu4 ke3 mo4"/>
   <item key="公布栏" value="gong1 bu4 lan2"/>
   <item key="供不应求" value="gong1 bu4 ying4 qiu2"/>
</lookupTable>

This data contains the pinyin from "gong1 an1" to "gong1 bu4". I converted it into DITA topic.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topic
  PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
<topic id="topic_dxr_ymr_qx" xml:lang="zh-CN">
   <title>indexterm testing</title>
   <prolog>
      <metadata>
         <keywords>
            <indexterm>公安</indexterm>
            <indexterm>公安官员</indexterm>
            <indexterm>公安机关</indexterm>
            <indexterm>公安局</indexterm>
            <indexterm>公案</indexterm>
            <indexterm>功败垂成</indexterm>
            <indexterm>公办</indexterm>
            <indexterm>宫保鸡丁</indexterm>
            <indexterm>公报</indexterm>
            <indexterm>宫爆鸡丁</indexterm>
            <indexterm>宫爆肉丁</indexterm>
            <indexterm>公报私仇</indexterm>
            <indexterm>弓背</indexterm>
            <indexterm>公倍式</indexterm>
            <indexterm>公倍数</indexterm>
            <indexterm>工笔</indexterm>
            <indexterm>攻砭</indexterm>
            <indexterm>工兵</indexterm>
            <indexterm>公秉</indexterm>
            <indexterm>公布</indexterm>
            <indexterm>功不可没</indexterm>
            <indexterm>公布栏</indexterm>
            <indexterm>供不应求</indexterm>
         </keywords>
      </metadata>
   </prolog>
   <body>
      <p>Hello World!</p>
   </body>
</topic>

The indexterm is already sorted by pinyin order. So I guess that you will get the same index order in your DITA-OT publish result. However this is not accurate because same character is isolated in the output. In my implementation the order should be follow the rule: "If the (Chinese) character is the same it should be sorted by pinyin/strokes/radical/GB0 code". The original collation rule for Simplified Chinese (zh-CN-sort-rules.txt) is made based on this rule.

@drmacro
Copy link
Contributor Author

drmacro commented Oct 18, 2016

Can you point to some authority, either online or a book I can buy, that explains this rule? Doing a Google search I couldn't find anything more detailed than "Sort by pinyin".

I'm also not sure I understand the error you're asserting: All the words in the sample data above have distinct pinyin, so they will all sort appropriately--I'm not seeing any ambiguity that would need to then be resolved by stroke count or radical. But maybe I'm missing what you're seeing.

@ToshihikoMakita
Copy link

ToshihikoMakita commented Oct 18, 2016

Following definition will be useful:

Han Collation
[http://site.icu-project.org/design/alphabetic-index]

  1. Pinyin: compare according to the pinyin for each character. If the pinyin is the same, compare by stroke order.
  2. Stroke: compare according to the total strokes for each character. If the total strokes are the same, compare by radical-stroke order.
  3. Radical-Stroke: compare according to the radical-stroke for each character. If these are the same, compare by code point order.

@drmacro drmacro self-assigned this Oct 18, 2016
@drmacro
Copy link
Contributor Author

drmacro commented Oct 18, 2016

I understand those rules to be for determining group ordering and grouping, not collation of the words.

In particular, you can't order words by comparing the individual characters in isolation.

So maybe we're talking about different things?

I can only see a need to fall back to stroke ordering or radical-stroke in the case where the same sequence of characters produces exactly the same pinyin and so far I haven't seen that case.

One thing I haven't worked out yet is simply how to set up primary, secondary, and tertiary sort keys for the ICU4J API (or if that's even sensible--it may not be).

@ToshihikoMakita
Copy link

ToshihikoMakita commented Oct 19, 2016

I understand those rules to be for determining group ordering and grouping, not collation of the words.
In particular, you can't order words by comparing the individual characters in isolation.
So maybe we're talking about different things?

It will be better to show you the actual example. Please download the picture from following URL:

https://1drv.ms/i/s!AkbL99fLhxKU1B99yCUl-NOsj0mC

This picture is taken from "A Chinese-English-Japanese-Korean Computer Dictionary '2001" published in China by ZHONGHUA BOOK COMPANY (中華書局).

You can see all of the word is sorted character by character and not the whole pinyin for words. This is natural order for Chinese or Japanese.

One thing I haven't worked out yet is simply how to set up primary, secondary, and tertiary sort keys for the ICU4J API (or if that's even sensible--it may not be).

In my understanding, this is used mainly for Latin languages that use Alphabets. About the Han Ideographic, all of the character is defined using primary difference.

If you want to confirm, following Java code will output the Java & ICU collation rule into UTF-8 text file. Contrary to above mention I found that tertiary difference is used in zh-CN collation rule in ICU4J.

import java.util.Locale;
import java.io.Writer;
import java.io.FileOutputStream;
import java.io.BufferedWriter;
import java.io.OutputStreamWriter;

public class CollatorRuleOutput {

    /**
     * @param args String[0]: Language Code
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
        String prmLang=args[0]; //Language-Code
        Locale locale;
        if (prmLang.contains("-")){
            String lang=prmLang.substring(0, 2);
            String country=prmLang.substring(3);
            locale = new Locale(lang,country);
        }else{
            locale=new Locale(prmLang);
        }
        java.text.RuleBasedCollator javaCollator = (java.text.RuleBasedCollator)java.text.Collator.getInstance(locale);
        String javaRules=javaCollator.getRules();
        writeToFile(prmLang+"-sort-rules-java.txt",javaRules);

        com.ibm.icu.text.RuleBasedCollator icuCollator = (com.ibm.icu.text.RuleBasedCollator)com.ibm.icu.text.Collator.getInstance(locale);
        String icuRules=icuCollator.getRules();
        writeToFile(prmLang+"-sort-rules-icu.txt",icuRules);

    }

    public static void writeToFile(String fileName, String content){
        Writer out;
        try { 
            out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileName), "UTF-8")); 
            out.write(content); 
            out.close(); 
        } catch (Exception e){
        }
    }

}

@drmacro
Copy link
Contributor Author

drmacro commented Oct 24, 2016

So if I'm understanding you correctly, you're saying that the sort order for zh-CN is based only the first character of a multi-character sequence, even when the character has one pinyin transliteration when used in isolation and another when it is the first character of a multi-character word?

That would make sense in that it then allows for non-dictionary-based collation but it is inconsistent with what I understood to be the rules for zh-CN sorting.

I wonder if part of the issue with the current code is that I'm getting the pinyin for the longest sequence of characters that match something in the dictionary, rather than using the actual words in the text. That could definitely result in different sort orders.

Now that I have the zh-CN word break iterator working I can try using that to get the pinyin by individual words. In many cases that would have the same effect as using the first character in isolation. Of course, it is also dependent on the correctness of the word-recognition algorithm, which I have no ability to evaluate.

@ToshihikoMakita
Copy link

So if I'm understanding you correctly, you're saying that the sort order for zh-CN is based only the first character of a multi-character sequence, even when the character has one pinyin transliteration when used in isolation and another when it is the first character of a multi-character word?
That would make sense in that it then allows for non-dictionary-based collation but it is inconsistent with what I understood to be the rules for zh-CN sorting.

No, it is not the desired understanding which I wanted to describe. The proposed zh-CN comparison steps between two strings (multi-character sequence) are as follows:

  1. Get the correct pinyin reading for every character. You can use dictionary based analysis for getting the correct pinyin.
  2. Make the sort key for every character based on [pinyin]/[stroke]/[radical]/[GB0 code]. This step may need another dictionary for getting stroke, radical and GB0 code for the character code.
  3. Compare two string by character by character using above key.

This method is different from using ICU RuleBasedCollator. It offers same collation order based above key but the pinyin for one character is fixed to most used one. As a result we sometimes receive the compliant from Chinese reader that the pinyin sorting order is not accurate.

Using dictionary based approach for getting correct pinyin will solve this problem.

Now that I have the zh-CN word break iterator working I can try using that to get the pinyin by individual words.

I have seen your code. Using ICU for this purpose may be effective. I also want to try this method when I become free.

@drmacro
Copy link
Contributor Author

drmacro commented Oct 25, 2016

I have implemented using word break iterator to build pinyin sort key (it's in the 0.2.2 release I posted yesterday).

WIth this change my client's glossary sort is closer to what they specified but not completely correct. Waiting for feedback from them on why it's wrong--it may be cause they have specialized terminology that is not in the dictionary.

One challenge here seems to be selecting words based on usage frequency, which the ICU documentation claims to have in their dictionary, but that still won't always give the correct answer.

I think in that case the only general solution is to have authors provide a sort-as value as for Japanese.

@ToshihikoMakita
Copy link

ToshihikoMakita commented Oct 25, 2016

I think in that case the only general solution is to have authors provide a sort-as value as for Japanese.

Over the ten years ago I've implemented this idea in I18n Index Libray.

https://www.antennahouse.com/antenna1/i18n-index-library/

– You can use the sortas attribute to correct Simplified Chinese index orders.
For example the Chinese word "粘贴" belongs to "N" index group because the most common reading of "粘" is "nian2". However the correct reading is "zhan1" for this word.

<indexterm><primary>粘贴</primary></indexterm>

You can correct this problem by specifying the correct reading (pinyin) to the sortas attribute value.
The fix will place "粘贴" into the "Z" index group.

<indexterm><primary sortas="zhan1 tie1">粘贴</primary></indexterm>

But it may be tedious job to author @sort-as for every pinyin exceptions.

@drmacro
Copy link
Contributor Author

drmacro commented Oct 25, 2016

The alternative would be to provide some kind of override or extension for the word break iterator. I haven't yet had time to look into what facilities the ICU4J library provides for that.

@drmacro
Copy link
Contributor Author

drmacro commented Oct 25, 2016

Basically sorting zh-CN correctly is hard bordering on impossible to automate correctly.

@drmacro
Copy link
Contributor Author

drmacro commented Sep 26, 2017

This post:

http://pinyin.info/news/2012/pinyin-sort-order/

Provides a different set of rules.

In particular, it does not rely on stroke count for disambiguating homophones but on "frequency of use" (although that requires a separate data set and I have no idea where that might be available for free).

In re-reading the ICU page you pointed to above it looks to me like they are describing the two different ways that Chinese is sorted:

  • pinyin for Simplified Chinese
  • Radical and stroke count for Traditional Chinese

But the writeup is not very clearly written.

@ToshihikoMakita
Copy link

http://pinyin.info/news/2012/pinyin-sort-order/

I think this rule defines the sort order for English to Chinese dictionary The sample links shows this fact.

http://www.uhpress.hawaii.edu/books/defrancisChinese.pdf

However one of my Chinese dictionary (© 東方書店+北京・商務印書館 精選日中・中日辞典 1999) does not follow this rule. The pinyin order is 一声(first tone)、二声(second tone)、三声(third tone)、四声(fourth tone)、軽声( fifth tone). But this rule says that the order should be 軽声、一声、二声、三声、四声.

I think this is because 軽声 has not special accent marks. So it may be natural for English people to place 軽声 to the first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants