Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Super slow tokenization of long strings containing multi-byte characters #45

Closed
alexdima opened this issue Nov 30, 2015 · 1 comment
Closed

Comments

@alexdima
Copy link
Contributor

From microsoft/vscode#94

I think this impacts also atom on Mac/Linux. Possible repro steps for atom / first-mate:

  • In first-mate/benchmark/large.min.js
  • on a long line (e.g. line 4)
  • insert a multi-byte character such as ü
  • run the benchmark
  • observe super slow times

I believe the root cause of the slowness is the charOffset <-> byteOffset conversion:

In my opinion, the only way to make this conversion fast is to compute it once up-front and use memoization for the conversions. However, the current caching mechanism is sub optimal since each string is cached per OnigScanner. So, if a line needs 20 OnigScanner's to tokenize, the conversion would still happen 20 times. The current caching mechanism is also "leakish" as each OnigScanner keeps references to the last matched string. Moreover, the current caching mechanism forces users of node-oniguruma to reuse to the maximum possible extent a certain OnigScanner instance in order to get caching.

I think that the users of node-oniguruma know best when and how they want to match a certain string.

I would like, when I get some free time, to work on a PR that does the following:

  • introduces & exposes to JS a new type called OnigString
  • the OnigString will compute the conversion up-front in its ctor.
  • the OnigString will provide cache slots for all OnigRegExp.
  • the OnigString has its lifecycle controlled by v8, so when javascript dereferences it, all cache slots and cached conversion is GCed.
  • make OnigRegExp::Search cached by only accepting an OnigString. We will therefore get that each individual regular expression is using the cache (no more need to do the trick with a OnigScanner with precisely one regular expression)
  • make OnigScanner::FindNextMatchSync accept a v8String or an OnigString. If it is called with a v8String it immediately constructs an OnigString, i.e. not doing any caching across calls.
  • remove all other caching done through OnigStringContext

This would change semantics & performance characteristics of node-oniguruma, requiring a new major version. After this change, the javascript users will be able to use the library with caching or without caching:

  • with caching
var scanner = new OnigScanner(...);
var str = new OnigString("my string");
var results = scanner.FindNextMatchSync(str, 0)
results = scanner.FindNextMatchSync(str, 5)
...
  • without caching
var scanner = new OnigScanner(...);
var results = scanner.FindNextMatchSync("my string", 0);
results = scanner.FindNextMatchSync("my string", 5);
...

I wanted to check with you @zcbenz @kevinsawicki if this is a change that takes node-oniguruma in a good direction and that you would agree with such a change ... before I invest time in it.

Thanks!

@alexdima
Copy link
Contributor Author

alexdima commented Jan 9, 2016

@zcbenz Thank you!

@alexdima alexdima closed this as completed Jan 9, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant