Regex::captures forces an allocation on every call #219

BurntSushi · 2016-04-29T15:10:59Z

Every time one calls captures, a new allocation for storing the location of captures is created. This allocation has size proportional to the number of captures in the regex.

This is also true for captures_iter, where every iteration results in a new allocation. An iterator could reuse the allocation in theory, but ownership of the captures is transferred to the caller. Even if we could reuse the capture locations, we couldn't give the caller a mutable borrow, since that immediately puts us in the "streaming iterator" conundrum.

The most sensible API I can think of is to:

Permit the caller to build empty Captures values from a given Regex such that it has the right size.
Pass a mutable borrow to a Captures to a call to captures, which lets the caller control the allocation.

It's not quite clear how to apply this to captures_iter while still implementing Iterator. I suspect we should probably borrow from the io::BufReader::read_line style methods. e.g.,

impl Regex {
    // Returns empty storage for captures for use with read_captures.
    fn new_captures(&self) -> Captures { ... }

    // On successful match, returns true and sets capture locations.
    // Otherwise returns false.
    fn read_captures(&self, caps: &mut Captures, text: &str) -> bool { ... }
    fn read_captures_iter<'r, 't>(&'r self, text: &'t str) -> ReadCapturesIter<'r, 't> { ... }
}

struct ReadCapturesIter<'r, 't> { ... }

impl<'r, 't> ReadCapturesIter<'r, 't> {
    // On successful match, returns true and sets capture locations.
    // Otherwise returns false.
    fn captures(&mut self, caps: &mut Captures) -> bool { ... }
}

And I think this would work well.

Main questions:

Any alternatives?
Do we replace the existing API with the one from above (in 1.0)? Or do we add it? My inclination is to add it, but I really hate expanding the API with more choices.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2016-04-30T13:18:38Z

Well, this doesn't quite work. new_captures really has to be defined like:

fn new_captures<'r, 't>(&'r self, text: &'t str) -> Captures { ... }

The reason being that all Captures values can report not just the offsets of a match, but the actual text as well. Therefore, it is tied to the haystack.

Another alternative is to move ownership of Captures, but this makes for a more awkward API. I think this would work though:

impl Regex {
    // Returns empty storage for captures for use with read_captures.
    fn new_captures(&self) -> Captures<'static> { ... }

    // Sets capture locations from searching over text.
    // If the match failed, the captures returned will report a failed match.
    fn read_captures<'a, 't>(&self, caps: Captures<'a>, text: &'t str) -> Captures<'t> { ... }
    fn read_captures_iter<'r, 't>(&self, text: &'t str) -> ReadCapturesIter<'r, 't> { ... }
}

type ReadCapturesIter<'r, 't> { ... }

impl<'r, 't> ReadCapturesIter<'r, 't> {
    // Sets capture locations from searching over text.
    // If the match failed, the captures returned will reported a failed match.
    fn captures<'a>(&mut self, caps: Captures<'a>) -> Captures<'t> { ... }
}

And we'll also need:

impl<'t> Captures<'t> {
    // Returns true if and only if the most recent search using these captures
    // returned a match.
    fn is_match(&self) -> bool { ... }
}

If we do go this route, I think it at least answers the question of whether this should replace the existing API or not (it shouldn't, because it's clunky).

I don't think I've seen this kind of API used much elsewhere, which also concerns me. I guess the lifetime pattern here is a bit weird. Basically, we want to enable the caller to reuse a particular allocation, but it's being reused inside a larger structure that offers more conveniences, and those conveniences require borrowing the searched text. And the searched text can vary while the allocation is reused.

Another alternative is to create a new captures type (Groups maybe?) which only stores and reports the offset information. This would enable the first style of API, which is a bit nicer, but introducing two types for captures seems like a smell.

Instead of defining a new type, we could do something like exposing the underlying representation, e.g., &mut [Option<usize>], but this is harder to use, and it's totally possible that we may want to change Option<usize> to something else some day. Hmm.

This commit exposes two new areas of API surface: 1. A new `captures_read` method which provides a way to access the offsets of submatches while amortizing the allocation of the space required to store those offsets. Callers should still of course prefer to use the higher level `captures` method, but if performance dictates, this lower level API may be useful. 2. New "at" variants of shortest_match/is_match/find/captures/captures_read that permit controlling where the start of a search begins within a slice. This is typically useful for controlling the match semantics of look-around operators such as `^` and `$`, and are necessary for implementing non-overlapping iterators. Fixes rust-lang#219

This commit exposes two new areas of API surface: 1. A new `captures_read` method which provides a way to access the offsets of submatches while amortizing the allocation of the space required to store those offsets. Callers should still of course prefer to use the higher level `captures` method, but if performance dictates, this lower level API may be useful. 2. New "at" variants of shortest_match/is_match/find/captures/captures_read that permit controlling where the start of a search begins within a slice. This is typically useful for controlling the match semantics of look-around operators such as `^` and `$`, and are necessary for implementing non-overlapping iterators. Fixes #219

BurntSushi added enhancement question labels Apr 29, 2016

BurntSushi added this to the 1.0 milestone Apr 29, 2016

BurntSushi removed this from the 1.0 milestone May 27, 2016

BurntSushi mentioned this issue Jun 25, 2018

regex: expose lower level search APIs #493

Merged

BurntSushi closed this as completed in #493 Jul 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex::captures forces an allocation on every call #219

Regex::captures forces an allocation on every call #219

BurntSushi commented Apr 29, 2016 •

edited

Loading

BurntSushi commented Apr 30, 2016 •

edited

Loading

Regex::captures forces an allocation on every call #219

Regex::captures forces an allocation on every call #219

Comments

BurntSushi commented Apr 29, 2016 • edited Loading

BurntSushi commented Apr 30, 2016 • edited Loading

BurntSushi commented Apr 29, 2016 •

edited

Loading

BurntSushi commented Apr 30, 2016 •

edited

Loading