-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace empty matches adjacent to a previous non-empty match in re.sub() #76489
Comments
Currently re.sub() replaces empty matches only when not adjacent to a previous match. This makes it inconsistent with re.findall() and re.finditer() which finds empty matches adjacent to a previous non-empty match and with other RE engines. Proposed PR makes all functions that makes repeated searching (re.split(), re.sub(), re.findall(), re.finditer()) mutually consistent. The PR change the behavior of re.split() too, but this doesn't matter, since it already is different from the 3.6 behavior. BDFL have approved this change. This change doesn't break any stdlib code. It is expected that it will not break much third-party code, and even if it will break some code, it can be easily rewritten. For example replacing re.sub('(.*)', ...) (which now matches an empty string at the end of the string) with re.sub('(.+)', ...) is an obvious fix. |
Could anybody please make a review of at least the documentation part? |
This was a really bad idea in my opinion. We just found this and we have no way to know how this will impact production. It's really absurd that re.sub('(.*)', r'foo', 'asd') is "foo" in python 1 to 3.6 but 'foofoo' in python 3.7. |
Just as a comparison, sed does the 3.6 thing:
|
It's now consistent with Perl, PCRE and .Net (C#), as well as re.split(), re.sub(), re.findall() and re.finditer(). |
That might be true, but that seems like a weak argument. If anything, it means those others are broken. What is the logic behind "(.*)" returning the entire string (which is what you asked for) and exactly one empty string? Why not two empty strings? 3? 4? 5? Why not an empty string at the beginning? It makes no practical sense. We will have to spend considerable effort to work around this change and adapt our code to 3.7. The lack of a discussion about backwards compatibility in this, and the other, thread before making this change is also a problem I think. |
Consider re.findall(r'.{0,2}', 'abcde'). It finds 'ab', then continues where it left off to find 'cd', then 'e'. It can also find ''; re.match(r'.*', '') does match, after all. It could, in fact, an infinite number of ''. And what about re.match(r'()*', '')? What should it do? Run forever? Raise an exception? At some point you have to make a decision as to what should happen, and the general consensus has been to match once. |
We were also bitten by this behavior change in google/vroom#110. I'm kinda baffled by the new behavior and assumed it had to be an accidental regression, but I guess not. If you have any other context on the BDFL conversation and reasoning for calling this behavior correct, I'd love to see additional info. |
We were also bitten by this. In fact we still run a compatibility shim in production where we log if the new and old behavior are different. We also didn't think this "bug fix" made sense or was treated with the appropriate gravity in the release notes. I understand the logic in the bug tracker and they it matches other languages is good. But the bahvior also makes no sense for the .* case unfortunately.
|
So third-party code was knowingly broken to satisfy an aesthetic notion that substitution should be more like iteration. Would not a FutureWarning have been a kinder way to stage this implementation? A foolish consistency, indeed. |
The former implementation was wrong. See bpo-25054 which contains more obvious examples of that bug: >>> re.sub(r"\b|:+", "-", "a::bc")
'-a-:-bc-' Not all colons were replaced despite the fact that the pattern matches all colons. |
@serhiy.storchaka Thanks for the link to bpo-25054 to clarify this change was not done solely for aesthetics. I wish it had been done in a more staged and overt way, but that is just spitting in the wind at this point. Thanks for all your work, my gripe du jour notwithstanding. |
If the behavior is obviously wrong (like in bpo-25054), we can fix it without warnings, and even backport the fix to older versions, because we do not expect that anybody depends on such weird behavior. If we are going to change the behavior, but expect that users can depend on the current behavior, we emit a FutureWarning first (and we did it for other changes in re). But this issue is the hard one. Before 3.7 we did not know that it is related to bpo-25054. We were not going to change this behavior (at least not in near future). But when a fix for bpo-25054 was written we did see that it is the same issue. We did not want to keep a bug in bpo-25054 few versions more, so we changed the behavior in this issue without warnings. It was an exceptional case. This change was documented, in the module documentation, and in "What's New in Python 3.7" (section "Porting to Python 3.7"). If this is not enough we will be happy to get help to make it better. |
I still think re.sub() behaves wrongly in certain cases like the following: |
Please read all previous discussions and explanations scattered across several issues. In your example, look how many occurrences of the pattern are found in the string:
And where they are found:
Now, if you replace characters in the ranges 0 to 17 and 17 to 17 with the replacement string "needle", what do you get? |
There was a behavioral change in Python 3.7 regarding empty matches. This adjusts the regex to no longer match an empty string. See python/cpython#76489
The ".*" can match empty line but the ".+" can't. This conversion is not equivalent. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: