Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Myanmar] Syllable reordering #165

Open
wezm opened this issue Jul 17, 2024 · 10 comments
Open

[Myanmar] Syllable reordering #165

wezm opened this issue Jul 17, 2024 · 10 comments
Assignees

Comments

@wezm
Copy link

wezm commented Jul 17, 2024

I've encountered an issue with the reordering description that results in MYANMAR DOT BELOW being moved when it probably should stay where it is.

The problem is that "All right-side and above-base dependent-vowel (matra) signs are tagged POS_AFTER_SUBJOINED." but DOT BELOW ends up with POS_AFTER_MAIN, which comes before POS_AFTER_SUBJOINED in the sort order.

Perhaps DOT BELOW is supposed to be picked up in step 7 but then in a syllable like "မော့" (U+1019 U+1031 U+102C U+1037) it's unclear what that step would set the pos to.

@n8willis n8willis self-assigned this Jul 23, 2024
@n8willis
Copy link
Owner

This is definitely something else that has needed a fresh look. The doc currently does not address the tone marks ... I evidently left a comment to that effect and had commented-out the _SMVD_ position (which would be last) at that time. There was also a rewrite of the Indic2 "closest non-mark character" language that would probably help here. Since U+102C is spacing, that probably means it attracts / collects the Dot Below.

Complicating that is that HarfBuzz treats all Myanmar tone marks as _NUKTA_. From what I have read so far, Tamil also permits a Nukta to attach to a spacing vowel mark, so that might be where there is some shared logic. I'm still taking a look; sorry to not have a clear reply.

@wezm
Copy link
Author

wezm commented Jul 23, 2024

I'm still taking a look; sorry to not have a clear reply.

No worries. No doubt it's all quite complicated to work out.

@n8willis
Copy link
Owner

I'm going to go ahead and push some SVG images changes that I was intermittently working on before; in theory they're better than PNGs for zooming in to see details, but mainly I don't want to attempt to fix the text then try to re-do the merge.

@n8willis
Copy link
Owner

Okay; went on a little tangent about the NUKTA, apparently. There's a good rationale for the way HarfBuzz classifies U+1037 as NUKTA, that UCD's DerivedCombiningClass.txt groups it there. So that's good.

It just happens to complicate inspecting test fonts (this issue against Noto Myanmar (Sans and Serif) looks to be related, but I think is actually caused by Noto ligating the 102f,1037 which HarfBuzz calls that unsafe ... and has some apparent workarounds (?) with inserting narrow spaces. Paduak doesn't do that, and seems like a better reference).

That being a little clearer to me now, I think that HarfBuzz is taking a different approach to the post-base tagging in step 2 (being more selective with what gets POS_AFTER_SUBJOINED, leaving more things POS_AFTER_MAIN) and maybe that's worth a try. With fewer things being _AFTER_SUB, fewer conflicts.

@n8willis
Copy link
Owner

n8willis commented Aug 4, 2024

I think I've got a way to untangle the reordering problem sorted out (initially, at least) in a way that makes sense to me.

Please have a look at stage 2 in #168 and let me know both if it reads correctly and sounds like it applies. It's basically just streamlining the post-base logic. Admittedly, it can't quite work in isolation from fixing up the regular expressions for syllable-matching, but they kind of go together in this case, I think. Meaning that, because the syllable matching is so complex with the post-base clusters, valid syllables that match do not need as much reordering on the post-base subsequences. They'll match the GSUB/GPOS rules because they wouldn't get identified as syllables otherwise.

The logic in the PR basically splits things at "has post-base matras" or "doesn't have post-base matras", in new steps 5 and 6, but it also adds in treatment of variation selectors, which I had not addressed previously.

(Less importantly, this change also removes a reference to "final reordering" earlier on that was clearly an artifact from reduplicating Indic2 doc structure.)

@wezm
Copy link
Author

wezm commented Aug 5, 2024

I think the changes in #168 sound like they apply. Skimming over my code it seems to match up pretty well. Some small comments:

Stage 2, step 4: Pre-base matras
Fourth, all left-side dependent-vowel (matra) signs must be tagged to be moved to the beginning of the syllable

It would be good if this step used the shaping classes/mark-placement subclass/regex classes defined earlier to make it clear which things it should match.

Fifth, if the syllable contains no below-base dependent-vowel (matra) signs

Similar for this description.

@n8willis
Copy link
Owner

n8willis commented Aug 7, 2024

Sounds good.

One thing I'm still a little less clear about is what's desirable if there are multiple below-base vowels in a row and one in the middle has an anusvara on it.... If the anusvaras get moved together, that sounds like something is getting lost, but I'm not sure if I understand how it would change pronunciation on a string of multiple vowels (or, indeed, if it actually happens).

Is there tooling for searching the text corpus for something like that?

@n8willis
Copy link
Owner

n8willis commented Aug 7, 2024

how it would change pronunciation on a string of multiple vowels

This meaning "if a single anusvara would automatically apply to the entire vowel subsequence, then it's fine to move as long as it stays with the below-base vowels generally."

@wezm
Copy link
Author

wezm commented Aug 8, 2024

Is there tooling for searching the text corpus for something like that?

Not that I know of but cobbled something together to try to see if I could find multiple below-base vowels in a row. As far as I can tell none of them had Anusvara in the middle, only after the last vowel. Such as:

  • သေုံး
  • သျှံး
  • လူုံ
  • ဥူုံ
  • ခြုုံ

@n8willis
Copy link
Owner

I was a little surprised that Python's unicodedata doesn't include ISC and some of those other properties.... But it seems like the sort of thing that, if you were to volunteer to add it, you'd shortly afterward be saddled with maintaining the module in perpetuity....

I made an update to the text to reference those classes; I'm still looking at whether I need to do that with the _consonant_/CONSONANT references, though, as per the other issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants