-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implicit wildcard for Prefix1 rules longer than 8 characters #3011
Comments
Assuming the word-boundary in uBO is from left to right:
Wrong assumption, rightly:
If your intention was to block strictly a javascript, not a phrase, then the rule is wrong, coz the rule is javascript specific only in your mind, coz the rule actually isn't javascript specific, because it in no way indicates a javascript, thus the rule is only a phrase specific, which highlights the first occurence of the phrase, from left to right in the logger. The correct actual javascript specific rule is |
The real explanation is that with fix to #2630, for any token longer than 7 characters, the word-boundary rule does not apply to the end of the string. Your filter qualifies as a "plain" filter, which means internally it's a simple but fast plain string comparison, no regex involved. After fixing #2630, I knew the side effect would be to no longer respect word boundary for the end of string + for token greater than 7 characters, but the efficiency gain of #2630 was too interesting so I decided the side effect was acceptable. This is what you see. I could theoretically add a second test to the plain string test to enforce the word boundary rule, but I decided for a wait-and-see approach -- at the end of the day, I want uBO to run as best as possible on low powered devices, and I decided the quirk was acceptable. So I will leave this issue opened, but I won't fix it for now. Eventually if ever a more serious side effect arise, then I will address it. |
Thanks for the full explanation. I was unaware of the 7 char token optimization from a few months ago. So are you saying that in my specific example, uBO is actually matching the "/cookieconsent2" portion of the URL with my rule because the first 7 chars match? This would mean that any plain word rule beyond 7 chars is an implicit wildcard up to a / delimiter, but without the regex overhead.
Please do add the second test for plain string rules. (Implicit wildcards are confusing and result in unwanted surprises.) There are hundreds of >7 char plain text rules in EasyList already, so there could be plenty of cases like mine out there in the wild that go largely unnoticed. |
Yes, and it just happens that this matches EasyList's assumption of no word boundary. |
But that contradicts #2902. It'd be best to have them all the same, regardless of string length, even with a tiny performance overhead. |
You don't mind a speculated "tiny performance overhead" but you mind the "no negative side effect to filtering". Just accept that I disagree with you wanting me to fix this now. I repeat:
|
I respect your decision but would like to get this contradiction resolved at some point. Is there a way to quickly check that if the plain string rule is longer than 7 chars then do a string compare of the plain rule with the delimited URL chunk? If so, that would be a very tiny performance overhead that would ensure a consistent word-boundary policy. It's tiny because you only do the second comparison in the event of a token match. (In my example, that would be a quick string compare of "cookieconsent" with "cookieconsent2" after the 7-char token match. This would fail, but the second tokenized URL chunk would match the word-boundary check.) |
While at it since I see this as related, just a thought: with the advent of WebAssembly, all these quirks and choices are likely to become obsolete if uBO were to move to a WebAssembly-based static filtering engine, because it would no longer have to deal with javascript getting in the way, and only one filter representation internally would be needed instead of many optimized for specific purpose. So maybe this is only then I will address the issue here. |
Wow, WebAssembly would be really cool! |
I decided to figure this out for myself. Here's my modified plain string test:
This is supposed to return false after finding the "2" in the "cookieconsent2" directory portion of the URL, then later return true when it finds the "cookieconsent" filename. But everything is working the same as before blocking and logger-wise, i.e. the logger is the exact same as my first post. Is there something wrong with my method? Or is it correct and the logger merely reports the first match? |
SOLVED IT So now I have a working word-boundary enforcer for the
Here's console log for my particular rule: The logger is still the same, so that's a bug there, but I can live with that knowing that word boundaries are now properly enforced for all of the rules I use. A general word-boundary fix for any other types of rules would have to be provided by gorhill. |
Some follow-up after testing this more. Now I realize Prefix1 means a 1-char prefix ('/', '_', etc.) before the tokenized word. (There's also a Prefix0 and other types, hence the need for a generalized rule, along with more appreciation and empathy from me for gorhill doing the hard, expert work of making this classification and optimizing it with things like tokens.) I recently went entirely solo for uBO in my primary browser, which is why I'm deep-diving these things. Without the massive 3rd party lists, you have to figure out these rule patterns for yourself. Thankfully I've been able to achieve great coverage with a surprisingly small amount of rules, like those of the I briefly stubbed out my length check to test any of my Prefix1, just to learn more of what's going on. The only other word-boundary violations I saw were these specific to IMDb (that I've had even before going solo) This is good confirmation of 2 things for me. 1) The word-boundary is indeed enforced, as there is only 1 match for each of these rules and the request was not blocked. 2) Need to keep my |
That was a pretty interesting read up, by both joey04 and gorhill. Especially what gorhill mentioned about WebAssembly too. |
In crafting a solution for #3037 it occurred to me that it's better to find the word boundary rather than a violation of it. So I've refined the above solution to be more robust, now able to find a non-token char at the end of the rule. This means both Here's the new version:
|
I revisited this today, since I was wondering if I could replace a group of related prefix1 rules (e.g. But it turns out that I now conclude that my prefix1 word-boundary enforcement covers all practical use cases, including the many prefix1's in EasyList. |
I'm closing this now because it's really only an issue for those (few) of us that decide to manage rules ourselves. It doesn't truly need to be fixed for the default list setup, given that EasyList terminates virtually all of its rules of this type with a boundary character. |
I thinks its best that the logger shows the real match which occurred inside uBO. |
Oops I meant to reopen the other: #3037. |
The highlight+boldface in the logger for static rule matching appears to just match the first instance of the request URL, which is not always the actual match for the rule.
At ezgif.com there's a cookie notice script blocked by adding a
/cookieconsent
rule (my own, not in a 3p list). Given the word-boundary matching of this rule, the logger should highlight the second instance in the URL, not the first:I've seen this a few other times at other sites with other word-boundary rules.
The text was updated successfully, but these errors were encountered: