the regex .{32768}
ended up running pretty slow. how can I make it go faster?
#2530
-
My goal: find all lines in markup exceeding 32767 characters. My naïve solution: Could I have done it better? Could ripgrep be better optimised for this unusual usecase? (My data is the corpus of books at https://github.com/standardebooks/, so a typical line length might be between 15 and a few thousand characters. There’s a little of 1.3GB of markup in total to scan. I didn’t log a specific runtime but it was around 15 minutes on a recentish Intel Macbook Air, sitting at 600-700% CPU.) |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
The problem is that when you write something like Otherwise, the main thing you can probably do is use So, that's Generally speaking, using a regex with crazy high bounded repeats like this is not a good idea. Regex engines just don't tend to handle them that well because the bounded repeats are really just terse syntax for writing bigger regexes. The bigger the regex the longer the search takes, generally speaking. And when you get so big that it cross internal heuristic thresholds for optimizations, the speed difference can become quite noticeable. With that said, for simple cases like this, we can probably do better. And when the regex is your only interface to a tool, there isn't much other choice here other than writing your own little quick program to do what you want. That's what this issue is for in the regex crate: rust-lang/regex#802
Whenever possible, please either share the actual data or the steps required to get the data. Otherwise this particular line doesn't really help me much. :-) |
Beta Was this translation helpful? Give feedback.
-
While late to the party, another suggestion is to add an anchor to the beginning of the pattern: This should prevent ripgrep from continuing to match additional characters on the same line, after it has already reported that there was at least one match there.
By default, ripgrep will display the contents of the match, so for every match, you are getting 32768+ characters returned to the terminal (or whatever process is consuming ripgrep's stdout). Try this:
|
Beta Was this translation helpful? Give feedback.
The problem is that when you write something like
.{5}
, it is quite literally translated as.....
. So when you do.{32768}
, it turns into a very giant regex..
is also further complicated by the fact that it is itself a small little state machine that matches the UTF-8 encoding of any Unicode scalar value (sans\n
). It's small in the sense that it's only about 12x bigger than(?-u:.)
(which matches any byte value except for\n
), but when you repeat it a large number of times, that small increase can add up. So you could try using(?-u:.)
instead if your data set is mostly ASCII, or if you can abide codepoints matching multiple.
.Otherwise, the main thing you can probably do is use
--dfa-…