-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for additional text encodings. #398
Conversation
cc @roblourens |
OK, I've updated the PR to take care of the internal buffer and memory map issue. I also refactored the transcoder to be more robust. I'd like to punt on adding new flags for now, which I think means I just need to work on writing a few more tests and this should be good to go! |
(Build failures are unrelated. There's a bug in nightly Cargo that's being fixed.) |
I've updated the PR with tests. I also ended up adding support for all encodings supported by |
This includes, but is not limited to, UTF-16, latin-1, GBK, EUC-JP and Shift_JIS. (Courtesy of the `encoding_rs` crate.) Specifically, this feature enables ripgrep to search files that are encoded in an encoding other than UTF-8. The list of available encodings is tied directly to what the `encoding_rs` crate supports, which is in turn tied to the Encoding Standard. The full list of available encodings can be found here: https://encoding.spec.whatwg.org/#concept-encoding-get This pull request also introduces the notion that text encodings can be automatically detected on a best effort basis. Currently, the only support for this is checking for a UTF-16 bom. In all other cases, a text encoding of `auto` (the default) implies a UTF-8 or ASCII compatible source encoding. When a text encoding is otherwise specified, it is unconditionally used for all files searched. Since ripgrep's regex engine is fundamentally built on top of UTF-8, this feature works by transcoding the files to be searched from their source encoding to UTF-8. This transcoding only happens when: 1. `auto` is specified and a non-UTF-8 encoding is detected. 2. A specific encoding is given by end users (including UTF-8). When transcoding occurs, errors are handled by automatically inserting the Unicode replacement character. In this case, ripgrep's output is guaranteed to be valid UTF-8 (excluding non-UTF-8 file paths, if they are printed). In all other cases, the source text is searched directly, which implies an assumption that it is at least ASCII compatible, but where UTF-8 is most useful. In this scenario, encoding errors are not detected. In this case, ripgrep's output will match the input exactly, byte-for-byte. This design may not be optimal in all cases, but it has some advantages: 1. In the happy path ("UTF-8 everywhere") remains happy. I have not been able to witness any performance regressions. 2. In the non-UTF-8 path, implementation complexity is kept relatively low. The cost here is transcoding itself. A potentially superior implementation might build decoding of any encoding into the regex engine itself. In particular, the fundamental problem with transcoding everything first is that literal optimizations are nearly negated. Future work should entail improving the user experience. For example, we might want to auto-detect more text encodings. A more elaborate UX experience might permit end users to specify multiple text encodings, although this seems hard to pull off in an ergonomic way. Fixes #1
if label == "auto" { | ||
return Ok(None); | ||
} | ||
match Encoding::for_label(label.as_bytes()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that the None
case results in error rather than continuing with some default value, it would make more sense to use the for_label_no_replacement
variant here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about the replacement encoding: https://encoding.spec.whatwg.org/#replacement
Thanks for the tip! :-)
Add support for additional text encodings.
This includes, but is not limited to, UTF-16, latin-1, GBK, EUC-JP
and Shift_JIS. (Courtesy of the
encoding_rs
crate.)Specifically, this feature enables ripgrep to search files that are
encoded in an encoding other than UTF-8. The list of available encodings
is tied directly to what the
encoding_rs
crate supports, which is inturn tied to the Encoding Standard. The full list of available encodings
can be found here: https://encoding.spec.whatwg.org/#concept-encoding-get
This pull request also introduces the notion that text encodings can be
automatically detected on a best effort basis. Currently, the only
support for this is checking for a UTF-16 bom. In all other cases, a
text encoding of
auto
(the default) implies a UTF-8 or ASCIIcompatible source encoding. When a text encoding is otherwise specified,
it is unconditionally used for all files searched.
Since ripgrep's regex engine is fundamentally built on top of UTF-8,
this feature works by transcoding the files to be searched from their
source encoding to UTF-8. This transcoding only happens when:
auto
is specified and a non-UTF-8 encoding is detected.When transcoding occurs, errors are handled by automatically inserting
the Unicode replacement character.
In all other cases, the source text is searched directly, which implies
an assumption that it is at least ASCII compatible, but where UTF-8 is
most useful. In this scenario, encoding errors are not detected.
This design may not be optimal in all cases, but it has some advantages:
able to witness any performance regressions.
low. The cost here is transcoding itself. A potentially superior
implementation might build decoding of any encoding into the regex
engine itself. In particular, the fundamental problem with
transcoding everything first is that literal optimizations are nearly
negated.
Future work should entail improving the user experience. For example, we
might want to auto-detect more text encodings. A more elaborate UX
experience might permit end users to specify multiple text encodings,
although this seems hard to pull off in an ergonomic way.
Fixes #1
Initial PR message:
This is in-progress work on adding UTF-16 support to ripgrep.
This "works," but it needs to be cleaned up a bit before it can be merged:
io::Read
interface. (This should be done optimistically by "bailing" out of the memory map strategy if there's a BOM rather than pessimistically to avoid additional syscalls in the happy path.)DecodeReader
needs to be tweaked so that its internal buffer can be reused. (Without this, the normal UTF-8 happy path causes ripgrep to be slower than grep!)If you want to try this PR out (which would be most welcome), then because of the memory map issue mentioned above, you'll need to pass the
--no-mmap
flag to ripgrep.To a first approximation, ripgrep is approximately 5x slower searching UTF-16 when compared to UTF-8. Interestingly, this applies both to searching a single ~1GB subtitle file and searching the Linux repo with all source files converted to UTF-16. At first I thought this seemed bad, but
iconv
itself isn't much faster---certainly within the same order of magnitude it seems.Another exciting thing I learned is that GNU grep doesn't support searching UTF-16 at all. For some reason, I thought it did. So this is an even better feature than I previously thought!