Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hangul Jamo vowels and trailing consonants should probably be 0 width #32

Open
ninjalj opened this issue Dec 27, 2021 · 3 comments
Open
Assignees
Labels
help wanted Extra attention is needed ucd

Comments

@ninjalj
Copy link

ninjalj commented Dec 27, 2021

U+1160..U+11FF and U+D7B0..U+D7FF should have 0 width.

Korean Hangul is a writing system which uses syllable blocks consisting of alphabetic components. A syllable consists of one or more Leading Consonants, one or more Vowels, and zero or more trailing consonants.

Unicode has precomposed syllable blocks at U+AC00..U+D7A3 (11172).

There are also component Jamos:

  • Hangul Jamo (U+1100..U+11FF).
    • U+1100..U+115F Choseong (initial, Leading Consonants) have East_Asian_Width=Wide and Hangul_Syllable_Type=Leading_Jamo
    • U+1160..U+11A7 Jungseong (medial, Vowels) have East_Asian_Width=Neutral and Hangul_Syllable_Type=Vowel_Jamo
    • U+11A8..U+11FF Jongseong (final, Trailing consonants) have East_Asian_Width=Neutral and Hangul_Syllable_Type=Trailing_Jamo
  • U+A960..U+A97F Hangul Jamo Extended-A (choseong) have East_Asian_Width=Wide
  • U+D7B0..U+D7FF Hangul Jamo Extended-B (jungseong and jongseong) have East_Asian_Width=Neutral
  • U+3130..U+318F Hangul Compatibility Jamo have no conjoining behavior
  • U+FFA0..U+FFDF half-width forms have no conjoining behavior.

U+1100..U+11FF, U+A960..U+A97F, U+D7B0..U+D7FF have conjoining behavior, a sequence of L+V+T* gets rendered as a syllable block. wcwidth() implementations tend to give U+1100..U+115F width 2, and U+1160..U+11FF width 0, so the resulting syllable block has the correct total width.

U+D7B0..U+D7FF, should also have width 0.

glibc gave width 0 to conjoining jungseong and jongseong at:

 commit 7a79e321c6f85b204036c33d85f6b2aa794e7c76
Author: Thorsten Glaser <[email protected]>
Date:   Fri Jul 14 14:02:50 2017 +0200

    Refresh generated charmap data and ChangeLog

            [BZ #21750]
            * charmaps/UTF-8: Refresh.

diff --git a/localedata/ChangeLog b/localedata/ChangeLog
index 04ef5ad071..9e05b4a652 100644
--- a/localedata/ChangeLog
+++ b/localedata/ChangeLog
@@ -1,3 +1,17 @@
+2017-07-14  Thorsten Glaser  <[email protected]>
+
+       [BZ #21750]
+       * charmaps/UTF-8: Refresh.
+       * unicode-gen/utf8_gen.py (U+00AD): Set width to 1.
+       * unicode-gen/utf8_gen.py (U+1160..U+11FF): Set width to 0.
+       * unicode-gen/utf8_gen.py (U+3248..U+324F): Set width to 2.
+       * unicode-gen/utf8_gen.py (U+4DC0..U+4DFF): Likewise.
+       * unicode-gen/utf8_gen.py: Treat category Me and Mn as combining.
+       [BZ #19852]
+       * unicode-gen/utf8_gen.py: Process EastAsianWidth lines before
+       UnicodeData lines so the latter have precedence; remove hack
+       to group output by EastAsianWidth ranges.
+

[ ... snip ...]

commit 6e540caa21616d5ec5511fafb22819204525138e
Author: Mike FABIAN <[email protected]>
Date:   Tue Jun 16 08:29:40 2020 +0200

    Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to UD7FB to 0 [BZ #26120]
Reviewed-by: default avatarCarlos O'Donell <[email protected]>

diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
index 14c5d4fa33..8cce47cd97 100644
--- a/localedata/charmaps/UTF-8
+++ b/localedata/charmaps/UTF-8
@@ -48920,6 +48920,8 @@ WIDTH
 <UABE8>        0
 <UABED>        0
 <UAC00>...<UD7A3>      2
+<UD7B0>...<UD7C6>      0
+<UD7CB>...<UD7FB>      0
 <UF900>...<UFA6D>      2
 <UFA70>...<UFAD9>      2
 <UFB1E>        0

@christianparpart
Copy link
Member

Hey @ninjalj. Sorry for the late reaction. I want to take care of it ASAP, but time is limited recently. So if no one is faster (by accident), then I'll do it ASAP. Many thanks for the very detailed information also here. :)

@christianparpart christianparpart added help wanted Extra attention is needed ucd labels Jan 30, 2022
@data-man data-man self-assigned this Jan 31, 2022
@data-man
Copy link
Contributor

data-man commented Feb 3, 2022

Interesting, utf8proc has printproperty binary (enabled by -DUTF8PROC_ENABLE_TESTING=ON option).
Some codepoints:
$ printproperty 1110

U+1110: ᄐ
category = Lo
combining_class = 0
bidi_class = 1
decomp_type = 0
uppercase_mapping = 1110 (seqindex ffff)
lowercase_mapping = 1110 (seqindex ffff)
titlecase_mapping = 1110 (seqindex ffff)
casefold = ᄐ
comb_index = 65535
bidi_mirrored = 0
comp_exclusion = 0
ignorable = 0
control_boundary = 0
boundclass = 6
charwidth = 2

$ printproperty 1160

U+1160: ᅠ
category = Lo
combining_class = 0
bidi_class = 1
decomp_type = 0
uppercase_mapping = 1160 (seqindex ffff)
lowercase_mapping = 1160 (seqindex ffff)
titlecase_mapping = 1160 (seqindex ffff)
casefold = ᅠ
comb_index = 65535
bidi_mirrored = 0
comp_exclusion = 0
ignorable = 1
control_boundary = 0
boundclass = 7
charwidth = 1

$ printproperty 11A8

U+11A8: ᆨ
category = Lo
combining_class = 0
bidi_class = 1
decomp_type = 0
uppercase_mapping = 11a8 (seqindex ffff)
lowercase_mapping = 11a8 (seqindex ffff)
titlecase_mapping = 11a8 (seqindex ffff)
casefold = ᆨ
comb_index = 65535
bidi_mirrored = 0
comp_exclusion = 0
ignorable = 0
control_boundary = 0
boundclass = 8
charwidth = 1

$ printproperty A960

U+A960: ꥠ
category = Lo
combining_class = 0
bidi_class = 1
decomp_type = 0
uppercase_mapping = a960 (seqindex ffff)
lowercase_mapping = a960 (seqindex ffff)
titlecase_mapping = a960 (seqindex ffff)
casefold = ꥠ
comb_index = 65535
bidi_mirrored = 0
comp_exclusion = 0
ignorable = 0
control_boundary = 0
boundclass = 6
charwidth = 2

$ printproperty D7B0

U+D7B0: ힰ
category = Lo
combining_class = 0
bidi_class = 1
decomp_type = 0
uppercase_mapping = d7b0 (seqindex ffff)
lowercase_mapping = d7b0 (seqindex ffff)
titlecase_mapping = d7b0 (seqindex ffff)
casefold = ힰ
comb_index = 65535
bidi_mirrored = 0
comp_exclusion = 0
ignorable = 0
control_boundary = 0
boundclass = 7
charwidth = 1

@ninjalj
Copy link
Author

ninjalj commented Feb 6, 2022

Will have to open another issue at utf8proc.

Some further discussion:

https://lists.gnu.org/archive/html/bug-libunistring/2021-12/msg00006.html (and replies)
ridiculousfish/widecharwidth#16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed ucd
Projects
None yet
Development

No branches or pull requests

3 participants