-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine Sort Ordering for zh-CN #1
Comments
Can you point to some authority, either online or a book I can buy, that explains this rule? Doing a Google search I couldn't find anything more detailed than "Sort by pinyin". I'm also not sure I understand the error you're asserting: All the words in the sample data above have distinct pinyin, so they will all sort appropriately--I'm not seeing any ambiguity that would need to then be resolved by stroke count or radical. But maybe I'm missing what you're seeing. |
Following definition will be useful: Han Collation
|
I understand those rules to be for determining group ordering and grouping, not collation of the words. In particular, you can't order words by comparing the individual characters in isolation. So maybe we're talking about different things? I can only see a need to fall back to stroke ordering or radical-stroke in the case where the same sequence of characters produces exactly the same pinyin and so far I haven't seen that case. One thing I haven't worked out yet is simply how to set up primary, secondary, and tertiary sort keys for the ICU4J API (or if that's even sensible--it may not be). |
It will be better to show you the actual example. Please download the picture from following URL: https://1drv.ms/i/s!AkbL99fLhxKU1B99yCUl-NOsj0mC This picture is taken from "A Chinese-English-Japanese-Korean Computer Dictionary '2001" published in China by ZHONGHUA BOOK COMPANY (中華書局). You can see all of the word is sorted character by character and not the whole pinyin for words. This is natural order for Chinese or Japanese.
In my understanding, this is used mainly for Latin languages that use Alphabets. About the Han Ideographic, all of the character is defined using primary difference. If you want to confirm, following Java code will output the Java & ICU collation rule into UTF-8 text file. Contrary to above mention I found that tertiary difference is used in zh-CN collation rule in ICU4J. import java.util.Locale;
import java.io.Writer;
import java.io.FileOutputStream;
import java.io.BufferedWriter;
import java.io.OutputStreamWriter;
public class CollatorRuleOutput {
/**
* @param args String[0]: Language Code
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
String prmLang=args[0]; //Language-Code
Locale locale;
if (prmLang.contains("-")){
String lang=prmLang.substring(0, 2);
String country=prmLang.substring(3);
locale = new Locale(lang,country);
}else{
locale=new Locale(prmLang);
}
java.text.RuleBasedCollator javaCollator = (java.text.RuleBasedCollator)java.text.Collator.getInstance(locale);
String javaRules=javaCollator.getRules();
writeToFile(prmLang+"-sort-rules-java.txt",javaRules);
com.ibm.icu.text.RuleBasedCollator icuCollator = (com.ibm.icu.text.RuleBasedCollator)com.ibm.icu.text.Collator.getInstance(locale);
String icuRules=icuCollator.getRules();
writeToFile(prmLang+"-sort-rules-icu.txt",icuRules);
}
public static void writeToFile(String fileName, String content){
Writer out;
try {
out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileName), "UTF-8"));
out.write(content);
out.close();
} catch (Exception e){
}
}
} |
So if I'm understanding you correctly, you're saying that the sort order for zh-CN is based only the first character of a multi-character sequence, even when the character has one pinyin transliteration when used in isolation and another when it is the first character of a multi-character word? That would make sense in that it then allows for non-dictionary-based collation but it is inconsistent with what I understood to be the rules for zh-CN sorting. I wonder if part of the issue with the current code is that I'm getting the pinyin for the longest sequence of characters that match something in the dictionary, rather than using the actual words in the text. That could definitely result in different sort orders. Now that I have the zh-CN word break iterator working I can try using that to get the pinyin by individual words. In many cases that would have the same effect as using the first character in isolation. Of course, it is also dependent on the correctness of the word-recognition algorithm, which I have no ability to evaluate. |
No, it is not the desired understanding which I wanted to describe. The proposed zh-CN comparison steps between two strings (multi-character sequence) are as follows:
This method is different from using ICU RuleBasedCollator. It offers same collation order based above key but the pinyin for one character is fixed to most used one. As a result we sometimes receive the compliant from Chinese reader that the pinyin sorting order is not accurate. Using dictionary based approach for getting correct pinyin will solve this problem.
I have seen your code. Using ICU for this purpose may be effective. I also want to try this method when I become free. |
I have implemented using word break iterator to build pinyin sort key (it's in the 0.2.2 release I posted yesterday). WIth this change my client's glossary sort is closer to what they specified but not completely correct. Waiting for feedback from them on why it's wrong--it may be cause they have specialized terminology that is not in the dictionary. One challenge here seems to be selecting words based on usage frequency, which the ICU documentation claims to have in their dictionary, but that still won't always give the correct answer. I think in that case the only general solution is to have authors provide a sort-as value as for Japanese. |
Over the ten years ago I've implemented this idea in I18n Index Libray. https://www.antennahouse.com/antenna1/i18n-index-library/
But it may be tedious job to author @sort-as for every pinyin exceptions. |
The alternative would be to provide some kind of override or extension for the word break iterator. I haven't yet had time to look into what facilities the ICU4J library provides for that. |
Basically sorting zh-CN correctly is hard bordering on impossible to automate correctly. |
This post: http://pinyin.info/news/2012/pinyin-sort-order/ Provides a different set of rules. In particular, it does not rely on stroke count for disambiguating homophones but on "frequency of use" (although that requires a separate data set and I have no idea where that might be available for free). In re-reading the ICU page you pointed to above it looks to me like they are describing the two different ways that Chinese is sorted:
But the writeup is not very clearly written. |
I think this rule defines the sort order for English to Chinese dictionary The sample links shows this fact. http://www.uhpress.hawaii.edu/books/defrancisChinese.pdf However one of my Chinese dictionary (© 東方書店+北京・商務印書館 精選日中・中日辞典 1999) does not follow this rule. The pinyin order is 一声(first tone)、二声(second tone)、三声(third tone)、四声(fourth tone)、軽声( fifth tone). But this rule says that the order should be 軽声、一声、二声、三声、四声. I think this is because 軽声 has not special accent marks. So it may be natural for English people to place 軽声 to the first. |
From Toshihiko Makita in a post to the DITA User's list:
Here is the excerpt from org.dita-community.i18n-develop/src/main/resources/lookup-zh-cn.xml.
This data contains the pinyin from "gong1 an1" to "gong1 bu4". I converted it into DITA topic.
The indexterm is already sorted by pinyin order. So I guess that you will get the same index order in your DITA-OT publish result. However this is not accurate because same character is isolated in the output. In my implementation the order should be follow the rule: "If the (Chinese) character is the same it should be sorted by pinyin/strokes/radical/GB0 code". The original collation rule for Simplified Chinese (zh-CN-sort-rules.txt) is made based on this rule.
The text was updated successfully, but these errors were encountered: