An attempt to improve performance of terminal/parser #17336

flyingcat · 2024-05-30T23:18:38Z

Description of the new feature/enhancement

Currently StateMachine use many if-else branches which may not an efficient way to implement a state machine. It could be improved by using a table approach.

Proposed technical implementation details (optional)

A quick and dirty implement lays on https://github.com/flyingcat/terminal/tree/vtparser

v1
- Use codegen.ps1 to generate a function pointer table to execute state transition.
- Only need to modify stateMachine.hpp|cpp. It has a demo build on vtparser_apply branch.
v2
- Merge OutputStateMachineEngine and use CRTP to reduce virtual calls, as the separation seems only for test and actually no need for runtime dispatch.
- Instead of using switch-case to dispatch actions on ActionExecute and ActionCsiDispatch, directly dispatch them on corresponding char.
- Unfortunately performance improvement is not as much as I expected and it affects many code. So may not worth it.

Some surprise:

Indirect functions calls are slower on x64 than ARM64. A possible optimization is eagerly pulling chars if easy branch prediction like _ActionParam.
On ARM64, the cost of tracing is heavy. Remove TraceOnEvent and TraceCharInput will reduce a lot of time.

Benchmarks (*_P for plain text, *_V for nvim vt output):

x64                                     ARM64                                   ARM64 no heavy tracing
-------------------------------         -------------------------------         -------------------------------
> VT_EN_P : 2.56 MB     Passed.         > VT_EN_P : 2.56 MB     Passed.         > VT_EN_P : 2.56 MB     Passed.
parser 0: 4907.81 us                    parser 0: 5061.98 us                    parser 0: 5052.18 us
parser 1: 4869.68 us, +0.8%             parser 1: 5110.27 us, -0.9%             parser 1: 4376.73 us, +15.4%
parser 2: 4450.93 us, +10.3%            parser 2: 4330.88 us, +16.9%            parser 2: 3803.58 us, +32.8%
> VT_EN_V : 11.36 MB    Passed.         > VT_EN_V : 11.36 MB    Passed.         > VT_EN_V : 11.36 MB    Passed.
parser 0: 150354.27 us                  parser 0: 167648.47 us                  parser 0: 165199.61 us
parser 1: 104640.22 us, +43.7%          parser 1: 122205.31 us, +37.2%          parser 1: 88889.78 us, +85.8%
parser 2: 98848.33 us, +52.1%           parser 2: 106223.27 us, +57.8%          parser 2: 83710.12 us, +97.3%
> VT_CN_P : 2.12 MB     Passed.         > VT_CN_P : 2.12 MB     Passed.         > VT_CN_P : 2.12 MB     Passed.
parser 0: 2746.15 us                    parser 0: 3009.11 us                    parser 0: 2982.51 us
parser 1: 2735.91 us, +0.4%             parser 1: 3085.54 us, -2.5%             parser 1: 2829.58 us, +5.4%
parser 2: 2613.93 us, +5.1%             parser 2: 2759.70 us, +9.0%             parser 2: 2793.15 us, +6.8%
> VT_CN_V : 11.29 MB    Passed.         > VT_CN_V : 11.29 MB    Passed.         > VT_CN_V : 11.29 MB    Passed.
parser 0: 214276.04 us                  parser 0: 238064.93 us                  parser 0: 236772.92 us
parser 1: 146150.74 us, +46.6%          parser 1: 175675.84 us, +35.5%          parser 1: 127224.79 us, +86.1%
parser 2: 140112.63 us, +52.9%          parser 2: 154611.41 us, +54.0%          parser 2: 121005.83 us, +95.7%

Execute .\bc.exe -v -vc .\bc_data.txt

x64                                     ARM64
-------------------------------         -------------------------------
before                                  before
129.715MB, 9.118s, 14.227MB/s           129.713MB, 9.246s, 14.029MB/s
after                                   after
129.713MB, 8.384s, 15.471MB/s           129.711MB, 8.327s, 15.578MB/s

Sorry for the bad English and messy code. Hope the idea is clear enough.

The text was updated successfully, but these errors were encountered:

lhecker · 2024-05-31T22:46:10Z

This is some of the most impressive stuff I've seen in a while lol. You're using a vim script to turn the typescript compiler into test VT output. The entire branch has so cool ideas and helpful bits! It'll take me a while to go through it and test it out. 🙂

Hope the idea is clear enough.

Very clear actually! Using lookup tables for the parser is something I wanted to do for a very long time. I'm extremely happy that you did it.

I feel like that tables are still superior even if they aren't faster, because it's easier to verify their correctness and debug them. It may take a little while for me to get a chance to look at your code in more detail, as I'm currently working on some difficult changes to ConPTY which will also improve performance a lot (it'll make Windows Terminal >10x faster!).

That aside, these performance numbers look wrong:

129.715MB, 9.118s, 14.227MB/s

You should get at least 100MB/s with our current parser (without lookup tables). If those numbers are from OpenConsole, please ensure you have built it in Release mode and that you don't have node.js running in the background. node.js subscribes to console events which slows down all console applications by 10x. (Even those that don't have anything to do with node.js!)

lhecker · 2024-09-14T14:52:40Z

I was experimenting with something like this today and kind of realized that LUTs for dispatch during parsing isn't quite right: We should instead defer parsing to specialized subroutines to consume as much input as they can. For instance, the CSI parser should be its own subroutine and parse all parameters at once before returning. This gives the largest performance boost for me.

Afterwards, we could dispatch the resulting type of sequence (e.g. based on the terminator) using a LUT of function pointers.

flyingcat · 2024-09-15T12:29:07Z

A bit embarrassing that I'm not sure what you mean. If something like this:

            for (auto end = _currentString.data() + _currentString.size(); p != end; p++)
            {
                if (*p >= L'0' && *p <= '9')
                {
                    _AccumulateTo(*p, currentParameter);
                }
                else
                {
                    break;
                }
            }

I think you are right. The less back to state machine, the more performance gain. At least it's true on my x64 system. Without this, it may even run slower than on ARM64 (AMD 3600 vs 8cx Gen 3). At first I wanted to parse the whole param and subparam list like a regular string parser, but that need writing much my own code, so... I gave up.

Also, good to know you're interested in this. You give too much praise for me. In fact, I dig into the code for another reason and after accidentally seeing this, I can't stop myself from trying to do it. It's literally "quick and dirty" and I'm sure you can do much better. I should post these after your first comment, just too lazy to that, as I need a translation tool to translate my thoughts.

lhecker · 2024-09-16T15:20:06Z

A bit embarrassing that I'm not sure what you mean. If something like this: [...]

Yeah, pretty much exactly like that. I wrote a small tokenizer for ENABLE_VIRTUAL_TERMINAL_INPUT over the weekend: https://gist.github.com/lhecker/8553eb80569900dcf1f9e83e88360cd5#file-token-c-L79-L217

It doesn't support many of our StateMachine features yet (I only support 3 out of these for instance), but it runs at ~1.5GB/s on my CPU so I think it may be the right path.

flyingcat added the Issue-Feature Complex enough to require an in depth planning process and actual budgeted, scheduled work. label May 30, 2024

microsoft-github-policy-service bot added Needs-Tag-Fix Doesn't match tag requirements Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels May 30, 2024

j4james mentioned this issue Jun 13, 2024

Add support for Sixel images in conhost #17421

Merged

lhecker self-assigned this Jun 13, 2024

lhecker added this to the Terminal v1.22 milestone Jun 13, 2024

carlos-zamora added Product-Conhost For issues in the Console codebase Area-VT Virtual Terminal sequence support and removed Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Jul 10, 2024

lhecker modified the milestones: Terminal v1.22, Terminal v1.23 Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An attempt to improve performance of terminal/parser #17336

An attempt to improve performance of terminal/parser #17336

flyingcat commented May 30, 2024

lhecker commented May 31, 2024

lhecker commented Sep 14, 2024 •

edited

Loading

flyingcat commented Sep 15, 2024

lhecker commented Sep 16, 2024

An attempt to improve performance of terminal/parser #17336

An attempt to improve performance of terminal/parser #17336

Comments

flyingcat commented May 30, 2024

Description of the new feature/enhancement

Proposed technical implementation details (optional)

lhecker commented May 31, 2024

lhecker commented Sep 14, 2024 • edited Loading

flyingcat commented Sep 15, 2024

lhecker commented Sep 16, 2024

lhecker commented Sep 14, 2024 •

edited

Loading