-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code #492
Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code #492
Conversation
2fff371
to
30fdbf1
Compare
Ok I've fixed some encoding issues with template on ruby 2.7- (passed with some ascii strings it was failing) and most hound warnings. I've left comments to explain the changes. All tests are passing. Let me know if you have any comment. As another validation step, I'm gonna run this version in my product (https://updown.io) which converts 150k+ URl regularly to see if I spot anything. |
30fdbf1
to
3f4611d
Compare
3f4611d
to
9499a18
Compare
I've just ran some test on 137k updown.io URLs, comparing normalization with previous and new code: Using this branch of addressable:
Same thing with
Then comparing the result (obfuscating the URLs):
Only two URL impacted, the difference is the same, I looked closer it's one of the difference between NFKC and NFC. So it looks all good here, I added two more specs to make sure the path is NFC normalized but NOT NFKC. One of the tests failed in my latest push but it's unrelated (it's a ruby download issue on the windows instance), I can't restart it but if you can it'll surely pass :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, though I'm no expert at all in this area! I trust our tests and your testing @jarthod :-) Thank you for this very clear pull request.
I was thinking about a major version bump, but I guess this very much a bug fix? Unless there are some very strange edge cases we don't know about and people rely on. |
Thanks! @brasic maybe you also have an opinion or feedback on this PR?
I don't think this deserves a major bump, it's more of a bugfix indeed. |
49bfa83
to
c772114
Compare
c772114
to
1998e06
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Sorry about being slow to respond! I was out on vacation last week and the laptop stayed home. |
@sporkmonger no problem, thank you! |
As discussed in #492 (comment), this change restores `unicode_normalize_kc` as a deprecated method (in case some people where using it). Example of the produced warning: ``` NOTE: Addressable::IDNA.unicode_normalize_kc is deprecated; use String#unicode_normalize(:nfkc) instead. It will be removed on or after 2023-04. Addressable::IDNA.unicode_normalize_kc called from benchmark/unicode_normalize.rb:17. ```
This is a fix for #408 and the beginning of the simplification and modernization of the IDNA code. It does:
libidn
unicode normalize implementation by ruby's, to avoid the null terminated string problem and support more recent unicode versions.Pure
unicode normalize by ruby's, to avoid hundreds of line of slow and legacy ruby code and support more recent unicode versions.The result in terms of performance is the following (benchmark code commited, but needs to be run on master to measure the legacy "pure" code):
The native ruby implementation is 5-6 times slower than
libidn
, but also 5 times faster than the existingPure
one.Considering this is just a small part and the rest of the ruby code is more than one order of magnitude slower (doing a full
normalize
in thesimple.rb
benchmark for example takes 8.6 seconds for the same 100k iterations on my machine), I think this is more than enough, no need to try to shave off microseconds in theunicode_normalize
code if the rest is a lot of ruby.And we benefit from the native ruby function which is going to be maintained and improved.
Specs have been added to cover the problematic
\x00
case.