Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-grained DWrite text analysis based on text complexity #9156

Closed
skyline75489 opened this issue Feb 14, 2021 · 10 comments · Fixed by #9202
Closed

Fine-grained DWrite text analysis based on text complexity #9156

skyline75489 opened this issue Feb 14, 2021 · 10 comments · Fixed by #9202
Labels
Area-Rendering Text rendering, emoji, complex glyph & font-fallback issues Issue-Task It's a feature request, but it doesn't really need a major design. Needs-Tag-Fix Doesn't match tag requirements Product-Conhost For issues in the Console codebase Product-Terminal The new Windows Terminal.
Milestone

Comments

@skyline75489
Copy link
Collaborator

skyline75489 commented Feb 14, 2021

Description of the new feature/enhancement

Inspired by microsoft/cascadia-code#411, certain ASCII characters sometimes break the simplicity of the entire text, depending on the font being used. The current implementation skips dwrite analysis when the entire text is simple:

if (!_isEntireTextSimple)
{
    // Call each of the analyzers in sequence, recording their results.
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeLineBreakpoints(this, 0, textLength, this));
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeBidi(this, 0, textLength, this));
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeScript(this, 0, textLength, this));
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeNumberSubstitution(this, 0, textLength, this));
    // Perform our custom font fallback analyzer that mimics the pattern of the real analyzers.
    RETURN_IF_FAILED(_AnalyzeFontFallback(this, 0, textLength));
}

With for example Fira Code, in most cases the optimization only applies to lines with 120 spaces, which is not good.

Proposed technical implementation details (optional)

GetTextComplexity can provide a breakdown report of the text, showing which specific range of the text is simple, we should be able to utilize it like this:

for (auto range : complexRanges)
{
    // Call each of the analyzers in sequence, recording their results.
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeLineBreakpoints(this, range, this));
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeBidi(this, range , this));
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeScript(this, range , this));
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeNumberSubstitution(this, range, this));
    // Perform our custom font fallback analyzer that mimics the pattern of the real analyzers.
    RETURN_IF_FAILED(_AnalyzeFontFallback(this, range));
}

See #6695 for the introduction of text complexity analysis.

@skyline75489 skyline75489 added the Issue-Feature Complex enough to require an in depth planning process and actual budgeted, scheduled work. label Feb 14, 2021
@ghost ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Feb 14, 2021
@skyline75489
Copy link
Collaborator Author

This should also help users who use non-English locales, for example avoid analyze entirely:

版权所有 (C) Microsoft Corporation。保留所有权利。

@skyline75489
Copy link
Collaborator Author

/cc @miniksa for both sanity & technical check

@skyline75489
Copy link
Collaborator Author

I've done some experiment and I found that the text complexity is not the same as run splitting. For example with the following text:

版权所有 (C) Microsoft Corporation。保留所有权利。

The text complexity analysis reports (a, b is pos, length pair) :

  • 0, 4: Complex
  • 4, 26: Simple
  • 30, 8: Complex
  • 38, 70: Simple

The run analysis split it into the following runs:

  • 0, 6
  • 6, 25
  • 31, 77

We might also need some sort of RLE implementation to find it a run is entire simple and then optimize the shaping process for the run.

@zadjii-msft zadjii-msft added Area-Rendering Text rendering, emoji, complex glyph & font-fallback issues Issue-Task It's a feature request, but it doesn't really need a major design. Product-Conhost For issues in the Console codebase Product-Terminal The new Windows Terminal. and removed Issue-Feature Complex enough to require an in depth planning process and actual budgeted, scheduled work. labels Feb 16, 2021
@ghost ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Feb 16, 2021
@zadjii-msft zadjii-msft added this to the Terminal Backlog milestone Feb 16, 2021
@miniksa
Copy link
Member

miniksa commented Feb 16, 2021

I agree that we should make use of the additional analysis information to improve performance in this way.

I do think that we could just further split the Runs and give them an additional simple-or-not parameter (bool) during the initial _AnalyzeTextComplexity that is just picked up during _AnalyzeRuns to determine the full analysis or skip and again during _ShapeGlyphRuns to determine the quick-mapping or slow-mapping to glyphs. In lieu of the whole thing being simple, a Run would be simple or not.

I'm not quite sure why your example maps as it does. Are some of those characters UTF-16 surrogate pairs?

@skyline75489
Copy link
Collaborator Author

those are just normal Chinese characters. Originally I thought text complexity analysis would split the text the same way as run splitting. Just want to add an example to show that it’s not.

a Run would be simple or no

This is likely undetermined. In the example above:

“版权所有 (”

This is a Run. But according to text complexity, the first 4 characters are complex, the last 2 characters are simple. This is what frustrates me. We can’t just simply know a Run is simple or not easily and optimize based on that.

@miniksa
Copy link
Member

miniksa commented Feb 17, 2021

Yeah but what I'm saying is that we can just call _SetCurrentRun and _SplitCurrentRun inside of _AnalyzeTextComplexity when we start listening to the length of the complexity and add the additional data.

So then you have a [0,4) complex run. [6,8) simple run. [8, 26) simple run. etc. etc.

@skyline75489
Copy link
Collaborator Author

skyline75489 commented Feb 17, 2021 via email

@miniksa
Copy link
Member

miniksa commented Feb 17, 2021

To your questions: oh probably. It's worth a try though to see if it just works. Sometimes the simple answer is "good enough". If it turns out to not be, we can refine further from there. Feel free to try/dig!

@DHowett DHowett removed the Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting label Feb 18, 2021
@ghost ghost added the In-PR This issue has a related PR label Feb 19, 2021
@ghost ghost closed this as completed in #9202 Apr 28, 2021
@ghost ghost removed the In-PR This issue has a related PR label Apr 28, 2021
ghost pushed a commit that referenced this issue Apr 28, 2021
This PR aims to optimize the text analysis process by breaking the text
into simple & complex runs according to the result of
`GetTextComplexity`. For simple runs, we can skip certain processing
steps to improve the analysis performance.

Previous to this PR, we rely on the result of `AnalyzeBidi`,
`AnalyzeScript` and `AnalyzeNumberSubstitution` to both break the text
into different runs and attach the corresponding
bidi/script/number_substitution information to the run. Thanks to #6695
we have the chance to skip the expensive analysis process when we found
the *entire text* is determined to be simple.

Inspired by microsoft/cascadia-code#411 and
discussions in #9156, I found that the "entire text simplicity" is often
hard to meet. In order to fully utilize the complexity information of
the text, we need to first break the text into simple & complex ranges.
These ranges are also the initial runs prior to the
bidi/script/number_substitution analysis. This way we can skip the text
analysis for simple runs to speed up the process.

VALIDATION
Build & run cmatrix, cacafire, cat big.txt with it.

Initial simple run PR: #6695
Closes #9156
@ghost ghost added the Resolution-Fix-Committed Fix is checked in, but it might be 3-4 weeks until a release. label Apr 28, 2021
@skyline75489
Copy link
Collaborator Author

Can we reopen this? #9202 was reverted.

#10036 is a unsuccessful attempt to patch #9202.

@zadjii-msft zadjii-msft reopened this Jul 20, 2021
@zadjii-msft zadjii-msft removed the Resolution-Fix-Committed Fix is checked in, but it might be 3-4 weeks until a release. label Jul 20, 2021
@lhecker
Copy link
Member

lhecker commented Oct 12, 2022

AtlasEngine does this! 💖

@lhecker lhecker closed this as completed Oct 12, 2022
@ghost ghost added the Needs-Tag-Fix Doesn't match tag requirements label Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Rendering Text rendering, emoji, complex glyph & font-fallback issues Issue-Task It's a feature request, but it doesn't really need a major design. Needs-Tag-Fix Doesn't match tag requirements Product-Conhost For issues in the Console codebase Product-Terminal The new Windows Terminal.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants