Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explore: removing or overhauling the EncodingReader #2513

Open
flavorjones opened this issue Apr 11, 2022 · 2 comments
Open

explore: removing or overhauling the EncodingReader #2513

flavorjones opened this issue Apr 11, 2022 · 2 comments
Milestone

Comments

@flavorjones
Copy link
Member

The Nokogiri::HTML4::EncodingReader class is used to try to detect encoding of HTML4 documents when they have ambiguous encoding.

Recently, a REDOS vulnerability was found in this code. There are other regular expressions which should be vetted; and we should explore replacing some of those regexes with simpler calls like String#include?.

This class was written during a time (Ruby 1.9) when Ruby strings were encoded as ASCII-8BIT by default. This hasn't been true since (I think) Ruby 2.0, and so this complexity may only be for an edge case that we no longer need to support; and so maybe we can remove the entire class thereby simplifying both CRuby and JRuby implementations.

@flavorjones
Copy link
Member Author

Perhaps more specifically: let's consider unifying the encoding detection algorithm from the HTML5 parser and the HTML4 parser.

@flavorjones
Copy link
Member Author

@stevecheckoway notes that the HTML5 encoding detection is incomplete with respect to https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant