Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set qualifiers - feature idea #11

Closed
mrabarnett opened this issue Jun 2, 2011 · 15 comments
Closed

set qualifiers - feature idea #11

mrabarnett opened this issue Jun 2, 2011 · 15 comments
Labels
bug Something isn't working minor

Comments

@mrabarnett
Copy link
Owner

Original report by Anonymous.


Some background: I've been working with very large REs in CPython and IronPython. We generate the RE pattern from lists, like lists of cities or lists of names, somewhat like this:

namelist = open("names.txt").read().split()
pattern = re.compile("|".join(namelist))

The one I'm working with now is just a pattern for finding substrings that look like the name of a person. It's overflowing the System::Text::RegularExpressions buffers on IronPython, but works OK with CPython 2.6 on 64-bit Ubuntu.

One of the things I've been thinking is that this kind of pattern should be handled differently. Suppose there was some syntax like

pattern = re.compile("(?S<names>)", names=ImmutableSet(namelist))

where (?S indicates a named ImmutableSet, the members of that set to be drawn from the keyword argument of that name. The compiler would generate a reasonably fast pattern from that set, say the union of all characters in all the strings in the set, and a max and min size based on the min-lengthed and max-lengthed elements of the set. When the engine runs, it would match that fast pattern, and if it matches, it would then check to see if the matched group is a member of the named set. If so, the match would be confirmed; if not, it would fail.

Seems like this might be a useful feature for regex to have, given the popularity of this kind of machine-generated RE.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Thinking about this a bit more, it would be more appropriate to use something like "\L<name>" instead of "(?S<name>)".

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Could you provide me with some test data so that I can see what's needed, how it would be used, try some experiments, and see whether 'feels' right, whether it's the right approach?

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Sure. Here's one I've been trying on CPython 2.6 on 64-bit Ubuntu (works), CPython 2.7 on 64-bit Windows (OverflowError), and IronPython 2.7 on 64-bit .NET (StackOverflowError).

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Named lists have been added (provisionally).

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


I downloaded the PyPI version, built and installed it on Python 2.5.1, and tried it:

>>> import regex
>>> p = regex.compile(r"333\L<bar>444", bar=set(["one", "two", "three"]))
>>> p.match("333four444")
>>> p.match("333four444")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: bad format char passed to Py_BuildValue

Does that seem right to you?

>>> p.match("333one444")
>>> 

And that should have matched, right?

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


It was passing "y#" for bytestrings, which is Python 3. Fixed.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Ah, OK. I re-downloaded from PyPI, now it's working. But here's another issue:

>>> p = regex.compile(r"3\L<bar>4\L<bar>+5", bar=sets.ImmutableSet(["one", "two", "three"]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.5/site-packages/regex.py", line 266, in compile
    return _compile(pattern, flags, kwargs)
  File "/Library/Python/2.5/site-packages/regex.py", line 371, in _compile
    parsed = parse_pattern(source, info)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 296, in parse_pattern
    branches = [parse_sequence(source, info)]
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 313, in parse_sequence
    item = parse_item(source, info)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 323, in parse_item
    element = parse_element(source, info)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 424, in parse_element
    return parse_escape(source, info, False)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 833, in parse_escape
    return parse_string_set(source, info)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 950, in parse_string_set
    return string_set(info, name)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 289, in string_set
    return StringSet(info, name)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 2637, in __init__
    index, min_len, max_len = info.string_sets[self.set_key]
ValueError: too many values to unpack
>>>

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Fixed.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


I've updated my test case to add some larger regular expressions.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


I just tested this enhancement (cf.: http://mail.python.org/pipermail/python-list/2011-June/1274529.html ) and would like to ask about the treatment of metacharacters in the items of the options set; I somehow implied from the overview text, they would be escaped, but they appear to be discarded completely, cf.:

>>> regex.findall(r"^\L<options>", "solid QWERT", options=set(['good', 'brilliant', '+s\\ol[i}d']))
['solid']
>>> regex.findall(r"^\L<options>", "+solid QWERT", options=set(['good', 'brilliant', '+solid']))
[]
>>> 

I believed, the first pattern shouldn't match if escaped (and cause an error if taken unchanged); the second one would match with escaping; or am I missing something?

regards,
vbr

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


You're not missing anything. They should match as you say. But I'm seeing a different result (Ubuntu 10 with Python 2.6):

>>> regex.findall(r"^\L<options>", "solid QWERT", options=set(['good', 'brilliant', '+s\\ol[i}d']))
[]
>>> regex.findall(r"^\L<options>", "solid QWERT", options=['good', 'brilliant', '+s\\ol[i}d'])
[]
>>> regex.findall(r"^\L<options>", "+solid QWERT", options=set(['good', 'brilliant', '+solid']))
[]
>>> regex.search(r"^\L<options>", "+solid QWERT", options=set(['good', 'brilliant', '+solid']))
>>> regex.search(r"^\L<options>", "+solid QWERT", options=set(['good', 'brilliant', 'solid']))
>>> regex.search(r"^\L<options>", "solid QWERT", options=['good', 'brilliant', '+s\\ol[i}d'])
>>>

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


This is an interesting one.

If the pattern is known, it fetches from the cache of already-compiled regexes, but the set of strings is different.

Should it treat the set as part of the pattern and recompile, much as it does with flags?

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Fixed. The regex will be recompiled.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Yes, I think that's the right call. The named keyword argument is local to the particular compile() or search() or findall() call. Different calls may use the same keyword name for different values.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Sorry for the delayed reaction (I somehow believed, I would be notified on further comments after my post).
I'd like to confirm the fix in regex-0.1.20110616; I agree with the current solution.
thanks;
vbr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working minor
Projects
None yet
Development

No branches or pull requests

1 participant