Skip to content

Latest commit

 

History

History
332 lines (299 loc) · 29.1 KB

character-tables-sinhala.md

File metadata and controls

332 lines (299 loc) · 29.1 KB

Sinhala character tables

This document lists the per-character shaping information needed to shape Sinhala text.

Table of Contents

Sinhala character table

Sinhala glyphs should be classified as in the following table. Codepoints in the Sinhala block with no assigned meaning are designated as unassigned in the Unicode category column.

Assigned codepoints with a null in the Shaping class column evoke no special behavior from the shaping engine. Note that this does include some valid codepoints, such as currency marks, punctuation, and other symbols.

Note: the NUMBER and SYMBOL Shaping classes are important during syllable identification, but generally evoke no further special behavior during the rest of the shaping process.

The Mark-placement subclass column indicates mark-placement positioning for codepoints in the Mark category. Assigned, non-mark codepoints have a null in this column and evoke no special mark-placement behavior. Marks tagged with [Mn] in the Unicode category column are categorized as non-spacing; marks tagged with [Mc] are categorized as spacing-combining.

Some codepoints in the following table use a Shaping class that differs from the codepoint's Unicode General Category. The Shaping class takes precedence during OpenType shaping, as it captures more specific, script-aware behavior.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+0D80 unassigned
U+0D81 Mark [Mn] _ BINDU TOP_POSITION ඁ Candrabindu
U+0D82 Mark [Mc] BINDU RIGHT_POSITION ං Anusvara
U+0D83 Mark [Mc] VISARGA RIGHT_POSITION ඃ Visarga
U+0D84 unassigned
U+0D85 Letter VOWEL_INDEPENDENT null අ A
U+0D86 Letter VOWEL_INDEPENDENT null ආ Aa
U+0D87 Letter VOWEL_INDEPENDENT null ඇ Ae
U+0D88 Letter VOWEL_INDEPENDENT null ඈ Aae
U+0D89 Letter VOWEL_INDEPENDENT null ඉ I
U+0D8A Letter VOWEL_INDEPENDENT null ඊ Ii
U+0D8B Letter VOWEL_INDEPENDENT null උ U
U+0D8C Letter VOWEL_INDEPENDENT null ඌ Uu
U+0D8D Letter VOWEL_INDEPENDENT null ඍ Vocalic R
U+0D8E Letter VOWEL_INDEPENDENT null ඎ Vocalic Rr
U+0D8F Letter VOWEL_INDEPENDENT null ඏ Vocalic L
U+0D90 Letter VOWEL_INDEPENDENT null ඐ Vocalic Ll
U+0D91 Letter VOWEL_INDEPENDENT null එ E
U+0D92 Letter VOWEL_INDEPENDENT null ඒ Ee
U+0D93 Letter VOWEL_INDEPENDENT null ඓ Ai
U+0D94 Letter VOWEL_INDEPENDENT null ඔ O
U+0D95 Letter VOWEL_INDEPENDENT null ඕ Oo
U+0D96 Letter VOWEL_INDEPENDENT null ඖ Au
U+0D97 unassigned
U+0D98 unassigned
U+0D99 unassigned
U+0D9A Letter CONSONANT null ක Ka
U+0D9B Letter CONSONANT null ඛ Kha
U+0D9C Letter CONSONANT null ග Ga
U+0D9D Letter CONSONANT null ඝ Gha
U+0D9E Letter CONSONANT null ඞ Nga
U+0D9F Letter CONSONANT null ඟ Nnga
U+0DA0 Letter CONSONANT null ච Ca
U+0DA1 Letter CONSONANT null ඡ Cha
U+0DA2 Letter CONSONANT null ජ Ja
U+0DA3 Letter CONSONANT null ඣ Jha
U+0DA4 Letter CONSONANT null ඤ Nya
U+0DA5 Letter CONSONANT null ඥ Jnya
U+0DA6 Letter CONSONANT null ඦ Nyja
U+0DA7 Letter CONSONANT null ට Tta
U+0DA8 Letter CONSONANT null ඨ Ttha
U+0DA9 Letter CONSONANT null ඩ Dda
U+0DAA Letter CONSONANT null ඪ Ddha
U+0DAB Letter CONSONANT null ණ Nna
U+0DAC Letter CONSONANT null ඬ Nndda
U+0DAD Letter CONSONANT null ත Ta
U+0DAE Letter CONSONANT null ථ Tha
U+0DAF Letter CONSONANT null ද Da
U+0DB0 Letter CONSONANT null ධ Dha
U+0DB1 Letter CONSONANT null න Na
U+0DB2 unassigned
U+0DB3 Letter CONSONANT null ඳ Nda
U+0DB4 Letter CONSONANT null ප Pa
U+0DB5 Letter CONSONANT null ඵ Pha
U+0DB6 Letter CONSONANT null බ Ba
U+0DB7 Letter CONSONANT null භ Bha
U+0DB8 Letter CONSONANT null ම Ma
U+0DB9 Letter CONSONANT null ඹ Mba
U+0DBA Letter CONSONANT null ය Ya
U+0DBB Letter CONSONANT null ර Ra
U+0DBC unassigned
U+0DBD Letter CONSONANT null ල La
U+0DBE unassigned
U+0DBF unassigned
U+0DC0 Letter CONSONANT null ව Va
U+0DC1 Letter CONSONANT null ශ Sha
U+0DC2 Letter CONSONANT null ෂ Ssa
U+0DC3 Letter CONSONANT null ස Sa
U+0DC4 Letter CONSONANT null හ Ha
U+0DC5 Letter CONSONANT null ළ Lla
U+0DC6 Letter CONSONANT null ෆ Fa
U+0DC7 unassigned
U+0DC8 unassigned
U+0DC9 unassigned
U+0DCA Mark [MN] VIRAMA TOP_POSITION ් Virama
U+0DCB unassigned
U+0DCC unassigned
U+0DCD unassigned
U+0DCE unassigned
U+0DCF Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ා Sign Aa
U+0DD0 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ැ Sign Ae
U+0DD1 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ෑ Sign Aae
U+0DD2 Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ි Sign I
U+0DD3 Mark [Mn] VOWEL_DEPENDENT TOP_POSITION ී Sign Ii
U+0DD4 Mark [Mn] VOWEL_DEPENDENT BOTTOM_POSITION ු Sign U
U+0DD5 unassigned
U+0DD6 Mark [Mn] VOWEL_DEPENDENT BOTTOM_POSITION ූ Sign Uu
U+0DD7 unassigned
U+0DD8 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ෘ Sign Vocalic R
U+0DD9 Mark [Mc] VOWEL_DEPENDENT LEFT_POSITION ෙ Sign E
U+0DDA Mark [Mc] VOWEL_DEPENDENT TOP_AND_LEFT_POSITION ේ Sign Ee
U+0DDB Mark [Mc] VOWEL_DEPENDENT LEFT_POSITION ෛ Sign Ai
U+0DDC Mark [Mc] VOWEL_DEPENDENT LEFT_AND_RIGHT_POSITION ො Sign O
U+0DDD Mark [Mc] VOWEL_DEPENDENT TOP_LEFT_AND_RIGHT_POSITION ෝ Sign Oo
U+0DDE Mark [Mc] VOWEL_DEPENDENT LEFT_AND_RIGHT_POSITION ෞ Sign Au
U+0DDF Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ෟ Sign Vocalic L
U+0DE0 unassigned
U+0DE1 unassigned
U+0DE2 unassigned
U+0DE3 unassigned
U+0DE4 unassigned
U+0DE5 unassigned
U+0DE6 Number NUMBER null ෦ Digit Zero
U+0DE7 Number NUMBER null ෧ Digit One
U+0DE8 Number NUMBER null ෨ Digit Two
U+0DE9 Number NUMBER null ෩ Digit Three
U+0DEA Number NUMBER null ෪ Digit Four
U+0DEB Number NUMBER null ෫ Digit Five
U+0DEC Number NUMBER null ෬ Digit Six
U+0DED Number NUMBER null ෭ Digit Seven
U+0DEE Number NUMBER null ෮ Digit Eight
U+0DEF Number NUMBER null ෯ Digit Nine
U+0DF0 unassigned
U+0DF1 unassigned
U+0DF2 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ෲ Sign Vocalic Rr
U+0DF3 Mark [Mc] VOWEL_DEPENDENT RIGHT_POSITION ෳ Sign Vocalic Ll
U+0DF4 Punctuation null null ෴ Kunddaliya
U+0DF5 unassigned
U+0DF6 unassigned
U+0DF7 unassigned
U+0DF8 unassigned
U+0DF9 unassigned
U+0DFA unassigned
U+0DFB unassigned
U+0DFC unassigned
U+0DFD unassigned
U+0DFE unassigned
U+0DFF unassigned

Sinhala Archaic Numbers character table

Sinhala text runs may also include glyphs from the Sinhala Archaic Numbers block. These characters should be classified as follows.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+111E0 unassigned
U+111E1 Number NUMBER null 𑇡 Archaic Digit One
U+111E2 Number NUMBER null 𑇢 Archaic Digit Two
U+111E3 Number NUMBER null 𑇣 Archaic Digit Three
U+111E4 Number NUMBER null 𑇤 Archaic Digit Four
U+111E5 Number NUMBER null 𑇥 Archaic Digit Five
U+111E6 Number NUMBER null 𑇦 Archaic Digit Six
U+111E7 Number NUMBER null 𑇧 Archaic Digit Seven
U+111E8 Number NUMBER null 𑇨 Archaic Digit Eight
U+111E9 Number NUMBER null 𑇩 Archaic Digit Nine
U+111EA Number NUMBER null 𑇪 Archaic Number Ten
U+111EB Number NUMBER null 𑇫 Archaic Number 20
U+111EC Number NUMBER null 𑇬 Archaic Number 30
U+111ED Number NUMBER null 𑇭 Archaic Number 40
U+111EE Number NUMBER null 𑇮 Archaic Number 50
U+111EF Number NUMBER null 𑇯 Archaic Number 60
U+111F0 Number NUMBER null 𑇰 Archaic Number 70
U+111F1 Number NUMBER null 𑇱 Archaic Number 80
U+111F2 Number NUMBER null 𑇲 Archaic Number 90
U+111F3 Number NUMBER null 𑇳 Archaic Number 100
U+111F4 Number NUMBER null 𑇴 Archaic Number 1000
U+111F5 unassigned
U+111F6 unassigned
U+111F7 unassigned
U+111F8 unassigned
U+111F9 unassigned
U+111FA unassigned
U+111FB unassigned
U+111FC unassigned
U+111FD unassigned
U+111FE unassigned
U+111FF unassigned

Vedic Extensions character table

Sanskrit runs written in the Sinhala script may also include characters from the Vedic Extensions block. These characters should be classified as follows.

Note: See the Vedic Extensions document for additional information.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+1CD0 Mark [Mn] CANTILLATION TOP_POSITION ᳐ Tone Karshana
U+1CD1 Mark [Mn] CANTILLATION TOP_POSITION ᳑ Tone Shara
U+1CD2 Mark [Mn] CANTILLATION TOP_POSITION ᳒ Tone Prenkha
U+1CD3 Punctuation null null ᳓ Sign Nihshvasa
U+1CD4 Mark [Mn] CANTILLATION OVERSTRUCK ᳔ Tone Midline Svarita
U+1CD5 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳕ Tone Aggravated Independent Svarita
U+1CD6 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳖ Tone Independent Svarita
U+1CD7 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳗ Tone Kathaka Independent Svarita
U+1CD8 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳘ Tone Candra Below
U+1CD9 Mark [Mn] CANTILLATION BOTTOM_POSITION ᳙ Tone Kathaka Independent Svarita Schroeder
U+1CDA Mark [Mn] CANTILLATION TOP_POSITION ᳚ Tone Double Svarita
U+1CDB Mark [Mn] CANTILLATION TOP_POSITION ᳛ Tone Triple Svarita
U+1CDC Mark [Mn] CANTILLATION BOTTOM_POSITION ᳜ Tone Kathaka Anudatta
U+1CDD Mark [Mn] CANTILLATION BOTTOM_POSITION ᳝ Tone Dot Below
U+1CDE Mark [Mn] CANTILLATION BOTTOM_POSITION ᳞ Tone Two Dots Below
U+1CDF Mark [Mn] CANTILLATION BOTTOM_POSITION ᳟ Tone Three Dots Below
U+1CE0 Mark [Mn] CANTILLATION TOP_POSITION ᳠ Tone Rigvedic Kashmiri Independent Svarita
U+1CE1 Mark [Mc] CANTILLATION RIGHT_POSITION ᳡ Tone Atharavedic Independent Svarita
U+1CE2 Mark [Mn] AVAGRAHA OVERSTRUCK ᳢ Sign Visarga Svarita
U+1CE3 Mark [Mn] null OVERSTRUCK ᳣ Sign Visarga Udatta
U+1CE4 Mark [Mn] null OVERSTRUCK ᳤ Sign Reversed Visarga Udatta
U+1CE5 Mark [Mn] null OVERSTRUCK ᳥ Sign Visarga Anudatta
U+1CE6 Mark [Mn] null OVERSTRUCK ᳦ Sign Reversed Visarga Anudatta
U+1CE7 Mark [Mn] null OVERSTRUCK ᳧ Sign Visarga Udatta With Tail
U+1CE8 Mark [Mn] AVAGRAHA OVERSTRUCK ᳨ Sign Visarga Anudatta With Tail
U+1CE9 Letter SYMBOL null ᳩ Sign Anusvara Antargomukha
U+1CEA Letter null null ᳪ Sign Anusvara Bahirgomukha
U+1CEB Letter null null ᳫ Sign Anusvara Vamagomukha
U+1CEC Letter SYMBOL null ᳬ Sign Anusvara Vamagomukha With Tail
U+1CED Mark [Mn] AVAGRAHA BOTTOM_POSITION ᳭ Sign Tiryak
U+1CEE Letter SYMBOL null ᳮ Sign Hexiform Long Anusvara
U+1CEF Letter null null ᳯ Sign Long Anusvara
U+1CF0 Letter null null ᳰ Sign Rthang Long Anusvara
U+1CF2 Letter CONSONANT_DEAD null ᳲ Sign Ardhavisarga
U+1CF3 Letter CONSONANT_DEAD null ᳳ Sign Rotated Ardhavisarga
U+1CF3 Mark [Mc] VISARGA null ᳳ Sign Rotated Ardhavisarga
U+1CF4 Mark [Mn] CANTILLATION TOP_POSITION ᳴ Tone Candra Above
U+1CF5 Letter CONSONANT_WITH_STACKER null ᳵ Sign Jihvamuliya
U+1CF6 Letter CONSONANT_WITH_STACKER null ᳶ Sign Upadhmaniya
U+1CF7 Mark [Mc] null null ᳷ Sign Atikrama
U+1CF8 Mark [Mn] CANTILLATION null ᳸ Tone Ring Above
U+1CF9 Mark [Mn] CANTILLATION null ᳹ Tone Double Ring Above
U+1CFA Letter PLACEHOLDER null ᳺ Sign Double Anusvara Antargomukha
U+1CFB unassigned
U+1CFC unassigned
U+1CFD unassigned
U+1CFE unassigned
U+1CFF unassigned

Miscellaneous character table

Other important characters that may be encountered when shaping runs of Sinhala text include the dotted-circle placeholder (U+25CC), the zero-width joiner (U+200D) and zero-width non-joiner (U+200C), and the no-break space (U+00A0).

The dotted-circle placeholder is frequently used when displaying a dependent vowel (matra) or a combining mark in isolation. Real-world text syllables may also use other characters, such as hyphens or dashes, in a similar placeholder fashion; shaping engines should cope with this situation gracefully.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+00A0 Separator PLACEHOLDER null   No-break space
U+200C Other NON_JOINER null ‌ Zero-width non-joiner
U+200D Other JOINER null ‍ Zero-width joiner
U+2010 Punctuation PLACEHOLDER null ‐ Hyphen
U+2011 Punctuation PLACEHOLDER null ‑ No-break hyphen
U+2012 Punctuation PLACEHOLDER null ‒ Figure dash
U+2013 Punctuation PLACEHOLDER null – En dash
U+2014 Punctuation PLACEHOLDER null — Em dash
U+25CC Symbol DOTTED_CIRCLE null ◌ Dotted circle

The zero-width joiner (ZWJ) is used to request the subjoined form of a consonant. The sequence "Consonant_1,Halant,ZWJ,Consonant_2" is used to specify the subjoined form of "Consonant_2".

A secondary usage of the zero-width joiner is to explicitly request the formation of "Reph". An initial "Ra,Halant,ZWJ" sequence should produce a "Reph".

The zero-width non-joiner (ZWNJ) is not used in shaping runs of Sinhala text. The ZWNJ is referenced below in various regular expressions and shaping rules, however, because it is used by other Indic scripts.

The no-break space (NBSP) is primarily used to display those codepoints that are defined as non-spacing (marks, dependent vowels (matras), below-base consonant forms, and post-base consonant forms) in an isolated context, as an alternative to displaying them superimposed on the dotted-circle placeholder. These sequences will match "NBSP,ZWJ,Halant,Consonant", "NBSP,mark", or "NBSP,matra".