Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex: enable support for partial matches #33688

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

rfourquet
Copy link
Member

@rfourquet rfourquet commented Oct 26, 2019

String str is a partial match for regex r when str is a potential prefix of a match for r. See for example https://www.pcre.org/current/doc/html/pcre2partial.html for a motivating example. Another example where I may need it is for #33617
PCRE supports that, so it's easy to expose this functionality, only the API has to be decided.

There are two options for partial matches: PCRE_PARTIAL_SOFT and PCRE_PARTIAL_HARD. With the soft option, a full match is prefered over a partial one, the hard option is the opposite.

The proposed API here is:

  • Regex constructor: add the p flag to mean PCRE_PARTIAL_SOFT, e.g. r"abc"p
  • match: add a partial field to mean that the match was partial
  • eachmatch: seems to work automatically (only the last element of the iterator can be partial), but we may forbid the presence of a partial flag
  • findnext: still needs to be updated, but I guess the choice of supporting partial matches should match that for eachmatch
  • occursin/startswith: with the current implementation, occursin(r, str) returns missing if str is a partial match (implemenation detail leak). I don't find it so bad: if we consider that the string str is "incomplete" when a partial flag is passed, and there is a partial match, we can only conclude that the information needed to say whether there is a match (the yet unvavailable end of the string) is missing. Should this return true instead? (then we can't discriminate between a full & partial match).
  • endswith fails with the "partial" options (incompatible options).

@rfourquet rfourquet added strings "Strings!" needs docs Documentation for this change is required needs news A NEWS entry is required for this change needs tests Unit tests are required for this change labels Oct 26, 2019
@rfourquet
Copy link
Member Author

Alternative to #33646 which I hadn't seen earlier unfortunately.

@StefanKarpinski
Copy link
Sponsor Member

The bit I'm not sure about is the p flag. The other flags are standard and taken from Perl, Ruby, etc. I don't want to end up in a situation where we use a p flag to mean one thing and another language uses it to mean something else.

@StefanKarpinski
Copy link
Sponsor Member

Indeed, p already means something as a regex flag in Perl, according to the perlre man page:

p: Preserve the string matched such that ${^PREMATCH} , ${^MATCH} , and ${^POSTMATCH} are available for use after matching.

Now, that's probably not something we're ever going to support, so maybe that's a good thing? At least we know that Perl won't introduce the p flag meaning something that we could support.

Ruby on the other hand only has these modifiers according to the Regexp docs:

/pat/i - Ignore case
/pat/m - Treat a newline as a character matched by .
/pat/x - Ignore whitespace and comments in the pattern
/pat/o - Perform #{} interpolation only once
/pat/u - UTF-8
/pat/e - EUC-JP
/pat/s - Windows-31J
/pat/n - ASCII-8BIT

Anyway, it seems safest to only allow the option initially and we can consider the p flag later.

@rfourquet
Copy link
Member Author

rfourquet commented Oct 28, 2019

EDIT: ouch, I hadn't seen your second answer, I kept this comment opened for a while before posting!

The other flags are standard and taken from Perl, Ruby, etc.

Oh I didn't know! So for Perl5 it looks like "p" modifier meant something, but is ignored since version 5.20 (cf. https://perldoc.perl.org/perlre.html for example). So p will unlikely be given another meaning. On the other hand, in Perl6, p or pos is a "matching adverb" which takes an argument ("anchor the match at a specific position in the string").

I'm not attached to the p flag, but without a flag, it's currently cumbersome to specify a matching option. They can be passed to the Regex constructor, but only after compile options. So without upgrading the API, you have to do somthing like: Regex("str", Base.DEFAULT_COMPILER_OPTS, Base.DEFAULT_MATCH_OPTS | Base.PCRE.PARTIAL_SOFT)

Side note: actually, in Perl6, (it looks like) "matching adverbs" can be used only when matching, not when defining a regex independantly of a matching context. This raises the question in Julia of whether a matching option, like "partial match", should really be part of the Regex object, or be passed to matching functions, like e.g. match(rx, str, partial=true) or match(rx, str, options=PCRE.PARTIAL_SOFT), similar for occursin etc. But anyways, in 1.x, matching options are (also) part of the Regex object.

missing
else
rc >= 0
end
Copy link
Member

@NHDaly NHDaly Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about why you're returning missing here. Is it as a way to indicate that there was a partial match?

If you want to do that, maybe it's better just to return rc instead and let the caller check the return value? It seems like a weird pun to use missing to mean "partial match".

Or maybe we define an enum for the return values or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it as a way to indicate that there was a partial match?

Here it's an implementation detail, but it gives directly the property that occursin(rx, str) can return missing for a partial match (cf. the OP). Partial match can be interpreted as "the end of the string is still not available, can I know yet whether I have a match?" In the context, missing means I don't know yet. And this allows to use three valued logic, which makes sense to me here. But in this function here indeed it is only an implementation detail which I don't mind to change, but I'm interested also to discuss the API for occursin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In my original PR, #33646, I simply returned true if the match is a partial match and you explicitly requested partial matching, OR if it's a complete match.

I guess the downside is that you lose the information about which type of match you got.

I like the change you made in this PR to propagate whether or not it was a partial match by adding a partial field to the RegexMatch. I just don't love the pun on three-valued logic. In my view, the match isn't missing at all -- it's partially matched, which is what you were asking it to check for. I would just change this to either return rc directly, and check it down below, or even better would be to define an enum here, and use that below.

Could you make that change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for such a delay! I made the change now, by using an Enum (no, constants, as @enum doesn't work in "pcre.jl"). I think I agree that a partial match in occursin should return true. Maybe we can consider later to add an option to get the information, e.g. occursin(rx, str, partial=missing) to mean: if the match is partial, return missing.

Copy link
Member

@NHDaly NHDaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @StefanKarpinski about the above. It's unfortunate, but i don't think we can use p to mean partial, because of weird conflicting flag names. We can always add it back later, as he suggests, but it would be nice to have this PR merged! :)

I've left another follow up comment on your use of missing. Sorry it took me so long to reply! :)

missing
else
rc >= 0
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In my original PR, #33646, I simply returned true if the match is a partial match and you explicitly requested partial matching, OR if it's a complete match.

I guess the downside is that you lose the information about which type of match you got.

I like the change you made in this PR to propagate whether or not it was a partial match by adding a partial field to the RegexMatch. I just don't love the pun on three-valued logic. In my view, the match isn't missing at all -- it's partially matched, which is what you were asking it to check for. I would just change this to either return rc directly, and check it down below, or even better would be to define an enum here, and use that below.

Could you make that change?

@rfourquet
Copy link
Member Author

I trimmed down this PR to the minimum, i.e. it doesn't add any new official API, and I followed @NHDaly 's suggestion to not use missing as a return value of PCRE.exec or occursin.

Introducing a "match flag" in the Regex stringmacro constructor for partial matches can be added later (besides p, I thought about ?, but I didn't check whether this conflict with flags in other systems).

I also didn't add the possibility to specify partial matches in the Regex(string, [flags]) constructor: it requires discussions and can be added later.

So only the undocumented Regex(string, integer, integer) can be used to speficy partial matches. I would really like that we get a documented API in a follow-up PR, but it would be cool if there is a chance to first get this in, possibly for 1.6.

@rfourquet rfourquet changed the title [WIP] Regex: enable support for partial matches Regex: enable support for partial matches Oct 19, 2020
@kshyatt
Copy link
Contributor

kshyatt commented May 7, 2022

Whatever happened to this PR? Should we close or can it be necro-ed?

@StefanKarpinski StefanKarpinski added the triage This should be discussed on a triage call label May 26, 2022
@LilithHafner LilithHafner removed the triage This should be discussed on a triage call label Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs docs Documentation for this change is required needs news A NEWS entry is required for this change needs tests Unit tests are required for this change strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants