Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex::captures forces an allocation on every call #219

Closed
BurntSushi opened this issue Apr 29, 2016 · 1 comment
Closed

Regex::captures forces an allocation on every call #219

BurntSushi opened this issue Apr 29, 2016 · 1 comment

Comments

@BurntSushi
Copy link
Member

BurntSushi commented Apr 29, 2016

Every time one calls captures, a new allocation for storing the location of captures is created. This allocation has size proportional to the number of captures in the regex.

This is also true for captures_iter, where every iteration results in a new allocation. An iterator could reuse the allocation in theory, but ownership of the captures is transferred to the caller. Even if we could reuse the capture locations, we couldn't give the caller a mutable borrow, since that immediately puts us in the "streaming iterator" conundrum.

The most sensible API I can think of is to:

  1. Permit the caller to build empty Captures values from a given Regex such that it has the right size.
  2. Pass a mutable borrow to a Captures to a call to captures, which lets the caller control the allocation.

It's not quite clear how to apply this to captures_iter while still implementing Iterator. I suspect we should probably borrow from the io::BufReader::read_line style methods. e.g.,

impl Regex {
    // Returns empty storage for captures for use with read_captures.
    fn new_captures(&self) -> Captures { ... }

    // On successful match, returns true and sets capture locations.
    // Otherwise returns false.
    fn read_captures(&self, caps: &mut Captures, text: &str) -> bool { ... }
    fn read_captures_iter<'r, 't>(&'r self, text: &'t str) -> ReadCapturesIter<'r, 't> { ... }
}

struct ReadCapturesIter<'r, 't> { ... }

impl<'r, 't> ReadCapturesIter<'r, 't> {
    // On successful match, returns true and sets capture locations.
    // Otherwise returns false.
    fn captures(&mut self, caps: &mut Captures) -> bool { ... }
}

And I think this would work well.

Main questions:

  1. Any alternatives?
  2. Do we replace the existing API with the one from above (in 1.0)? Or do we add it? My inclination is to add it, but I really hate expanding the API with more choices.
@BurntSushi
Copy link
Member Author

BurntSushi commented Apr 30, 2016

Well, this doesn't quite work. new_captures really has to be defined like:

fn new_captures<'r, 't>(&'r self, text: &'t str) -> Captures { ... }

The reason being that all Captures values can report not just the offsets of a match, but the actual text as well. Therefore, it is tied to the haystack.

Another alternative is to move ownership of Captures, but this makes for a more awkward API. I think this would work though:

impl Regex {
    // Returns empty storage for captures for use with read_captures.
    fn new_captures(&self) -> Captures<'static> { ... }

    // Sets capture locations from searching over text.
    // If the match failed, the captures returned will report a failed match.
    fn read_captures<'a, 't>(&self, caps: Captures<'a>, text: &'t str) -> Captures<'t> { ... }
    fn read_captures_iter<'r, 't>(&self, text: &'t str) -> ReadCapturesIter<'r, 't> { ... }
}

type ReadCapturesIter<'r, 't> { ... }

impl<'r, 't> ReadCapturesIter<'r, 't> {
    // Sets capture locations from searching over text.
    // If the match failed, the captures returned will reported a failed match.
    fn captures<'a>(&mut self, caps: Captures<'a>) -> Captures<'t> { ... }
}

And we'll also need:

impl<'t> Captures<'t> {
    // Returns true if and only if the most recent search using these captures
    // returned a match.
    fn is_match(&self) -> bool { ... }
}

If we do go this route, I think it at least answers the question of whether this should replace the existing API or not (it shouldn't, because it's clunky).

I don't think I've seen this kind of API used much elsewhere, which also concerns me. I guess the lifetime pattern here is a bit weird. Basically, we want to enable the caller to reuse a particular allocation, but it's being reused inside a larger structure that offers more conveniences, and those conveniences require borrowing the searched text. And the searched text can vary while the allocation is reused.

Another alternative is to create a new captures type (Groups maybe?) which only stores and reports the offset information. This would enable the first style of API, which is a bit nicer, but introducing two types for captures seems like a smell.

Instead of defining a new type, we could do something like exposing the underlying representation, e.g., &mut [Option<usize>], but this is harder to use, and it's totally possible that we may want to change Option<usize> to something else some day. Hmm.

@BurntSushi BurntSushi removed this from the 1.0 milestone May 27, 2016
BurntSushi added a commit to BurntSushi/regex that referenced this issue Jun 25, 2018
This commit exposes two new areas of API surface:

  1. A new `captures_read` method which provides a way to access the
     offsets of submatches while amortizing the allocation of the
     space required to store those offsets. Callers should still of
     course prefer to use the higher level `captures` method, but if
     performance dictates, this lower level API may be useful.
  2. New "at" variants of
     shortest_match/is_match/find/captures/captures_read that permit
     controlling where the start of a search begins within a slice.
     This is typically useful for controlling the match semantics of
     look-around operators such as `^` and `$`, and are necessary for
     implementing non-overlapping iterators.

Fixes rust-lang#219
BurntSushi added a commit to BurntSushi/regex that referenced this issue Jun 25, 2018
This commit exposes two new areas of API surface:

  1. A new `captures_read` method which provides a way to access the
     offsets of submatches while amortizing the allocation of the
     space required to store those offsets. Callers should still of
     course prefer to use the higher level `captures` method, but if
     performance dictates, this lower level API may be useful.
  2. New "at" variants of
     shortest_match/is_match/find/captures/captures_read that permit
     controlling where the start of a search begins within a slice.
     This is typically useful for controlling the match semantics of
     look-around operators such as `^` and `$`, and are necessary for
     implementing non-overlapping iterators.

Fixes rust-lang#219
BurntSushi added a commit that referenced this issue Jul 18, 2018
This commit exposes two new areas of API surface:

  1. A new `captures_read` method which provides a way to access the
     offsets of submatches while amortizing the allocation of the
     space required to store those offsets. Callers should still of
     course prefer to use the higher level `captures` method, but if
     performance dictates, this lower level API may be useful.
  2. New "at" variants of
     shortest_match/is_match/find/captures/captures_read that permit
     controlling where the start of a search begins within a slice.
     This is typically useful for controlling the match semantics of
     look-around operators such as `^` and `$`, and are necessary for
     implementing non-overlapping iterators.

Fixes #219
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant