Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow *editing* of large files when tree-sitter is used #3072

Open
kchibisov opened this issue Jul 14, 2022 · 21 comments
Open

Very slow *editing* of large files when tree-sitter is used #3072

kchibisov opened this issue Jul 14, 2022 · 21 comments
Labels
A-tree-sitter Area: Tree-sitter C-bug Category: This is a bug

Comments

@kchibisov
Copy link

kchibisov commented Jul 14, 2022

Summary

Certain markdown files are very slow to edit. In particular when they have a lot of lists.
For example a CHANGELOG.md file from the helix repo is a enough to trigger it.

The problem is likely in grammar here, since neovim with tree-sitter works fine on it.

While this particular problem could be solved with slightly optimized, but less accurate grammar. However the issue manifests itself on any large file of any other type, just requires much larger file than ~600LoC(I've used 50K LoC C++ that I have around).

To avoid delay and input lag in those particular scenarios it would make sense to either run tree-sitter only on a part of the buffer (I'm not familiar how it works) or update it after the typing ended. Since right now you have a latency of like 2 seconds on 50K lines file of C++, which is impossible to work with.

Reproduction Steps

Try to edit CHANGELOG.md from the helix repo or any large file.

Helix log

No response

Platform

Linux

Terminal Emulator

alacritty

Helix Version

helix 22.05 (3cced1e)

@kchibisov kchibisov added the C-bug Category: This is a bug label Jul 14, 2022
@MDeiml
Copy link
Contributor

MDeiml commented Jul 14, 2022

Author of the markdown grammar here. It seems that at the moment language injection cannot be parsed incrementally. The markdown grammar uses injection a lot so it becomes quite apparent for it. This is what I get from line 846 here:

let layer = self
.layers
.iter_mut()
.find(|(_, layer)| {
layer.depth == depth && // TODO: track parent id instead
layer.config.language == config.language && layer.ranges == ranges
})
.map(|(id, _layer)| id);

Notice how a layer can only be reused if ranges did not change. But this is often the case when editing a file. The test here could probably be laxer.

I didn't look to deep, but (from my knowledge of treesitter) any layer should be fine here to "reuse" as long as none is reused twice. So something akin to

  1. Choosing the layer whose ranges "match best" in some way
  2. Changing those ranges to the new ones

should work.

@archseer
Copy link
Member

We actually follow tree-sitter-cli/atom logic to update all the markers according to user edits beforehand: https://github.com/helix-editor/helix/blob/8681fb6d9ee06610fac2baafb4ca34f93f9c7f4d/helix-core/src/syntax.rs#L644-L726=

So the injection layers are tracked reliably.

We also seem to only use HTML injections & code blocks: https://github.com/helix-editor/helix/blob/master/runtime/queries/markdown/injections.scm

@MDeiml
Copy link
Contributor

MDeiml commented Jul 15, 2022

Ah ok, I guess you are still using the old version of my parser then (the main branch). You could try updating to the split_parser branch. That one has some major performance improvements, but uses injection very extensively.

@archseer
Copy link
Member

Ah, good point! I'll see if I can update us to the new branch 👍

kirawi added a commit to kirawi/helix that referenced this issue Jul 19, 2022
@the-mikedavis the-mikedavis added the A-tree-sitter Area: Tree-sitter label Jul 31, 2022
@asb
Copy link

asb commented Aug 7, 2022

On an 8.4MiB markdown file (I keep daily notes in a single big document), helix 22.05 takes > 10s to open it and is used 1GiB of memory after doing so. I'm assuming this is tree-sitter related and the same issue described here, but let me know if you think not.

@asb
Copy link

asb commented Aug 8, 2022

The following script generates a ~5MiB file that takes >5 seconds to open with Helix. After opening, memory usage (reported via ps_mem) is ~660MiB.

#!/bin/sh

for i in $(seq 10000); do
  cat <<EOF >> big_markdown.md
# Title
## Subtitle
Some notes about things.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt
in culpa qui officia deserunt mollit anim id est laborum

* A list
  * With a sub-item
* Item 2
* Item 3
* Item 4
* Item 5
  * Item 5.1

EOF
done

@kchibisov
Copy link
Author

@asb if you disable grammar for markdown does that happen?

@asb
Copy link

asb commented Aug 8, 2022

@kchibisov I'm not sure what option to change to disable the grammar (I was trying out helix for the first time). Certainly, if mv big_markdown.md big markdown.txt and open the version of the file with a .txt extension, it opens instantly and helix consumes a more reasonable 18MiB RAM.

@MDeiml
Copy link
Contributor

MDeiml commented Aug 8, 2022

I think the simplest way to fix this is to only run the grammar for injected languages if they are visible (i.e. run it lazily).

Markdown is split into two grammars: block and inline. The block-level grammar is quite fast and produces the ranges on which the inline-level grammar is injected. So the computational cost mostly grows with the amount of inline content parsed.

I'd guess that the memory cost is also from the parsed inline-level trees, so that problem would also be solved in the same way, but I'd have to check that.

@MDeiml
Copy link
Contributor

MDeiml commented Aug 8, 2022

I'm not sure how to best implements something like this tough. Currently the parsing is handled by the Document impl, but for lazy parsing we need information from the Views that point to that document.

@Yevgnen
Copy link
Contributor

Yevgnen commented Aug 15, 2022

Having trouble when open rust.org.gz 3000 line .org file and never finish. Renaming it to abc.xyz to avoid syntax highlight then the process is instant.

Screen Shot 2022-08-15 at 14 23 26

@kirawi
Copy link
Member

kirawi commented Sep 7, 2022

I think the simplest way to fix this is to only run the grammar for injected languages if they are visible (i.e. run it lazily).

Markdown is split into two grammars: block and inline. The block-level grammar is quite fast and produces the ranges on which the inline-level grammar is injected. So the computational cost mostly grows with the amount of inline content parsed.

I'd guess that the memory cost is also from the parsed inline-level trees, so that problem would also be solved in the same way, but I'd have to check that.

I think that this could build upon the work in #2857 to keep track of the nodes and lazily highlight injections.

@MDeiml
Copy link
Contributor

MDeiml commented Oct 13, 2022

Just wanted to mention #4146 here. Maybe that fixed some problems, but probably not all of them.

@kchibisov
Copy link
Author

That indeed improved the situation and the editing of markdown is at least possibly with tree-sitter.

Large files still blow up helix leading to seconds in input lag. Not sure where it should be tracked.

@midnightcodr
Copy link

With helix 22.12 (34be71fb) I am seeing about 1 second delay per keystroke when doing either insert or search. The file I am editing is a 129MB mysql dump file with 150 very long lines.

The machine I am working on is a MacBook Pro M1 with 16GB of RAM.

@dead10ck
Copy link
Member

Wondering if #6218 fixed this

@archseer
Copy link
Member

It probably fixed or improved viewing the file, but insert mode will still be just as slow on large files.

@archseer
Copy link
Member

The file I am editing is a 129MB mysql dump file with 150 very long lines.

Yeah both of these are an issue. You're parsing a very large file but performance is also not optimized on long lines. Or at least it wasn't before but maybe @pascalkuthe's rewrite improved things slightly.

@pascalkuthe
Copy link
Member

pascalkuthe commented Mar 10, 2023

The file I am editing is a 129MB mysql dump file with 150 very long lines.

Yeah both of these are an issue. You're parsing a very large file but performance is also not optimized on long lines. Or at least it wasn't before but maybe @pascalkuthe's rewrite improved things slightly.

Right now it shouldnt make much if a difference. Howver performance for huge lines will be improved in a future followup PR to #5420 where I will Implement line chunking. This was omitted from the MVP but the architecture lends itself well to this optimization (and was designed with it in mind, I even had a mostly finished implementation but the changes were accidently deleted by an overwrite from a second helix instance)

@kchibisov
Copy link
Author

While my original issue is resolved with the markdown.

I'd think I'll leave a real world example of a big file that is well formatted and not some dump of the logging utility, since I've mentioned in original report a C++ file, but I haven't showed it.

So that's the file I've referred to originally wrt my C++ quote.

https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86ISelLowering.cpp

The :set-language text makes a normal editing experience(same perf as file of any other size).

@kchibisov kchibisov changed the title Slow tree-sitter performance on certain files Very slow *editing* of large files when tree-sitter is used Apr 7, 2023
@mvdan
Copy link
Contributor

mvdan commented May 3, 2023

I use git commit --verbose, which causes the COMMIT_EDITMSG files to contain the full diff of what's being committed. At work, I just happened to commit changes weighing about ~400k lines of diff, all generated, and I experienced this slowness - each insert keystroke lagged by about a second. @kchibisov's suggestion of :set-language text works like a charm as a workaround, for now.

Still happening with 23.03, for what it's worth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tree-sitter Area: Tree-sitter C-bug Category: This is a bug
Projects
None yet
Development

No branches or pull requests