Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegExp - add case insensitive matching option #1541

Merged
merged 10 commits into from
Jul 8, 2020

Conversation

markharwood
Copy link
Contributor

Relates to Jira issue 9386

Added a new CASE_INSENSITIVE option to the existing flags.
The RegExp class is a little strange because instances represent either the parser or the parsed objects it nests in a tree. The flags field is only relevant to the root parser and was left blank in all parsed nodes. This PR's changes require that the flags int is propagated to all nodes so that they can see if it includes the case insensitive option (all other bits in the flag represent parsing options so there was no need to propagate before).

@markharwood markharwood self-assigned this May 28, 2020
@markharwood markharwood requested a review from jimczi May 28, 2020 13:41
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments but looks great @markharwood

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the flag should be renamed UNICODE_CASE_INSENSITIVE since it handles code points and not only US-ASCII charset like the java's Pattern flag.

@jpountz
Copy link
Contributor

jpountz commented May 28, 2020

The flags that can be passed to the constructor are about the supported operators. Case insentivity is not an operator so it feels wrong to ask users to configure it this way too? I think it's fine to record this information in the flags internally, but maybe we should make the constructor take an additional boolean instead of expecting users to configure case insensitivity via a syntax flag?

@markharwood
Copy link
Contributor Author

maybe we should make the constructor take an additional boolean instead of expecting users to configure case insensitivity via a syntax flag?

Does it change things if we consider Java's case insensitivity is also a bit mask flag passed to the constructor?

@markharwood
Copy link
Contributor Author

markharwood commented Jun 1, 2020

On reflection, you're right - the single flag is trappy.
I'd like to refactor this class to make this simpler. The root problem we have is propagating parser state (flags/options) down to the objects that represent clauses in the parse tree. This is made difficult by the fact that RegExp is a single class representing both the parser and the parsed nodes.
I suggest refactoring so that :

  1. RegExp remains the user-facing class with the public constructor and has the parsing logic
  2. We use a new private class RegExpClause to hold clause state, but being an inner class it has access to the flags in the outer RegExp instance that contains it.

This should solve the problem of propagating settings and give us a sounder footing to build on.
Should I do this refactor as part of this PR or another @jpountz ?

One issue is that this would technically be a breaking change as https://issues.apache.org/jira/projects/LUCENE/issues/LUCENE-9371 opened up the internal state of the parser and we will change the class of nodes.

@jpountz
Copy link
Contributor

jpountz commented Jun 2, 2020

@markharwood Java's case-insensitivity flag is indeed a bit mask flag passed to the constructor, but my comment was more about the fact that the current flags on RegExp are about what operators are supported, while the bit you are adding controls how matching works. So in my opinion, it should be a different constructor argument (boolean or other bit set depending on whether we're seeing other desirable ways to control how matching works). We could merge both bitsets internally if that makes things easier, my concern is only about the API.

I don't have an opinion about the RegExp split, but I don't feel bad about propagating information recursively like in your PR.

@markharwood
Copy link
Contributor Author

I updated the constructor on this @jpountz - good to go?

@markharwood markharwood force-pushed the fix/9386 branch 3 times, most recently from 7bc5c1d to def0f8e Compare June 16, 2020 13:01
@markharwood markharwood force-pushed the fix/9386 branch 2 times, most recently from 990c3d0 to 750d612 Compare June 24, 2020 16:30
@markharwood
Copy link
Contributor Author

Thanks for the reviews @jpountz and @jimczi
I think I've addressed all the review comments now if you have a chance to take another look

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one minor comment, LGTM otherwise

/**
* Allows case insensitive matching of ASCII characters.
*/
public static final int ASCII_CASE_INSENSITIVE = 0x0100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it CASE_INSENSITIVE since we want to leave the door for another flag that would control if unicode should be handled fully ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it might be useful if the flag name reflected the current limitations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then what would be the other flag to allow unicode support ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNICODE_CASE_INSENSITIVE or just CASE_INSENSITIVE?
Either way it would cover ASCII and all other UNICODE characters.
Admittedly slightly odd that the two flags overlap but the alternative is people may assume that a non-qualified name like "CASE_INSENSITIVE" would cover all the bases when we only currently support ASCII.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok fine by me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we ensure that the flag is not used in the syntax_flags since we merge the two internally ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption was this class was lenient to syntax flags > 0xff before this change so should remain so for BWC reasons

@markharwood
Copy link
Contributor Author

OK to merge this @jpountz ?
I had one last comment re flag validation

*/
public RegExp(String s, int syntax_flags, int match_flags) throws IllegalArgumentException {
// (for BWC reasons we don't validate invalid bits, just trim instead)
syntax_flags = syntax_flags & 0xff;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to maintain bw compat for this, is there any test that fails if you remove this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, no. The change would be to remove that line and replace with

if (syntax_flags >  ALL) {
  throw new IllegalArgumentException("Illegal syntax flag");
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants