Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support searching compressed files using in-process decompression #225

Closed
danbst opened this issue Nov 8, 2016 · 38 comments
Closed

support searching compressed files using in-process decompression #225

danbst opened this issue Nov 8, 2016 · 38 comments
Labels
icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon. question An issue that is lacking clarity on one or more points.

Comments

@danbst
Copy link

danbst commented Nov 8, 2016

I'd like to use ripgrep for grepping log files, because it's faster then grep. But my logs are gzipped, and if I zcat | rg them I'll loose log filenames in output.

Also, would be great if bzip2 and xz decompressors will be supported too with automatic archive type detection.

@BurntSushi BurntSushi added the question An issue that is lacking clarity on one or more points. label Nov 9, 2016
@BurntSushi
Copy link
Owner

BurntSushi commented Nov 9, 2016

I'm torn on this one for both pragmatic and philosophical reasons.

Philosophically, this is making ripgrep do a lot more than just "open a file and search it." I don't actually believe that ripgrep should do everything, even though it already does a lot. Nevertheless, it's hard to argue with the usefulness of this feature.

Practically speaking, there are two primary issues as I see it:

  1. If we wanted to implement this right now, we'd need to bring in C libraries. While there's no hard requirement against doing this (to my knowledge, as long as the three main platforms are supported), I would like to actively discourage it in the interest of keeping ripgrep pure Rust.
  2. This requires some kind of UX design. How does ripgrep detect compressed files? When it finds them, does it always search them or does a user explicitly need to enable it? How are decompression options, if they exist, controlled? What is "automatic archive type detection"?

I think that if we were to do this, it is at least blocked on some initial implementation of #1, since supporting additional text encodings has a lot of overlap with decompressing files before searching in terms of making the core search routines work with it.

With that said, I don't personally see myself working on this any time soon.

@alejandro5042
Copy link
Contributor

alejandro5042 commented Nov 10, 2016

Brainstorming here...

Instead of rg constantly playing catching up on every type of compressed file option, an alternative is to have a plugin system whereby the user can decide what rg does when it encounters a file. For example, rg can look up the extension in a configuration file (or the glob in a .gitattributes-type of file), execute the preprocessor step, runs the search on the results, then deletes it. Alternatively, it can simply stream the output of the command in memory and skip the file writes. Seems like quite a bit of work though....

Anyway, this could also be used to search Word, binary, XML, or other complicated formats; if for example you could turn the complicated format into plain-text format that rg can happily parse.

@BurntSushi
Copy link
Owner

BurntSushi commented Nov 10, 2016

As the maintainer, I'm not particularly interested in the plugin path, sorry.

@danbst
Copy link
Author

danbst commented Nov 11, 2016

Ok, I see. Also, I don't think that ripgrep -z will perform much better then zgrep, because decompression takes time and levels the ripgrep speed (unless Rust can magically do faster decompression).

Now, none of ag, rg and zgrep can replace each other - zgrep is faster then ag on compressed files and beats rg, ag can do searches in gzip, zip, lzma and xz (with autodetection), which none of zgrep and rg do, and rg is fastest when no compression is used and your regex feats declared expressivity.

@BurntSushi
Copy link
Owner

I don't think that ripgrep -z will perform much better then zgrep

That might be true on single files (it really depends on what proportion of time is spent in decompression), but ripgrep can certainly get wins with parallelism.

Now, none of ag, rg and zgrep can replace each other

Can we not use this as motivation for adding new features please? I ask this because it will always be true. For example, if you need a POSIX compliant search tool, then neither ripgrep nor ag will suffice. If you need backreferences or lookaround, then ripgrep won't work but grep -P and and ag will. I don't think that will ever change. ripgrep will never be everything to everyone and that's OK.

With that said, it's certainly reasonable to expect some convergence of features. For example, I specifically built ripgrep so that folks could use it for both the "search a large repo of code" and "search very large files" use cases.

I think my initial comment on this issue still stands: this feature is a possibility, but has a few philosophical and practical problems with it.

@wavexx
Copy link

wavexx commented Feb 2, 2017

I would expect at least basic unix stream compressors to be supported (.gz/.bz2/.xz). All editors decompress (and often recompress) those on the fly.

Debian ships share/doc/* files in compressed form when doing so results in space savings. I couldn't do a quick search there, and searching through docs is something I do often. It makes perfect sense to have documentation compressed.

Source is sometimes compressed too. Emacs compresses .el files by default on install as well. Ironically, using emacs ripgrep package I cannot search into an installed emacs lisp files ;)

I expect a performance hit when searching through compressed files, so I don't think that's a problem. My main concern is that I could miss a match because one of the files has been compressed. For text data files, this happens frequently, especially in repositories that need to float over the network.

@BurntSushi
Copy link
Owner

@wavexx I don't think there's any question that this is a desirable feature. Thank you for sharing your use cases though, they're helpful.

@moorereason
Copy link

I'm troubleshooting a cups server issue (for 3 days now), and I wish I could use a feature like this. Some files within the cups installation are gzipped while others are not, and they're scattered all of the place. So far I've been doing find . -type f -name "*.gz" | xargs zgrep needle, but I'm not finding what I expect to see and feel like I may be missing something.

Regarding the issues Andrew brought up earlier: 1) I'd rather not have ripgrep linked to C libs 👎 and 2) for the UX design, I'd just offer a -z option and have ripgrep work in "zip-only" mode. That would be good enough for my use.

I'm a nobody, but my vote is to freeze this issue until pure-Rust compression libs are available and then re-evaluate this idea at that point.

@BurntSushi BurntSushi added the icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon. label Mar 13, 2017
@rik
Copy link

rik commented Mar 21, 2017

On macOS, zgrep is the 2.5.1 BSD version. BSD grep is noticeably slower than GNU grep. I haven't found a way to easily install the GNU version so it would be nice to have a fast and convenient to grep directories of compressed log files.

@BurntSushi
Copy link
Owner

@rik I think you can install the GNU tools through brew? (I'm not a Mac user though, so don't take my word for it. I think you wind up needing to use ggrep for GNU grep, for example.)

@rik
Copy link

rik commented Mar 23, 2017

Sorry, I should have mentioned that I've looked into Homebrew/homebrew-dupes. Yes you can do that but that only installs grep, not zgrep.

@Xenofex
Copy link

Xenofex commented Apr 20, 2017

I use ag and sift for my everyday search, not rg just because it doesn't support gzip. This UX issue matters a lot when I do a lot of log search as a sysadmin. Generally log files are larger than code files. Now I have 3.6GB gzipped log files to search frequently, where the raw speed of a search tool really shines. As of my code, I don't really matter if it is ag, sift or rg, because every of them gives me the result fast enough.

As a search utility focusing on speed, I think log file search should be its target use case, where your effort devoted really helps.

Thanks!

@wavexx
Copy link

wavexx commented Jul 4, 2017

To follow up on my previous comment, and to make a recommendation, it would be nice if ripgrep would just use readily available stream compressors directly as a co-process instead of embedding a decompression library. Although for small files the fork might incur in some penalty, it's unlikely ripgrep will ever be faster than pbzip2, with the advantage that you gain instant access to all common unix formats at once (you only need an extension/compressor map). You could always specialize z/gz later on to gain advantage for small scattered files.

@lnicola
Copy link

lnicola commented Jul 4, 2017

@wavexx Please don't. Not everyone is on Linux etc.

@wavexx
Copy link

wavexx commented Jul 4, 2017 via email

@BurntSushi
Copy link
Owner

@wavexx Thanks for the suggestion. I'm somewhat attracted to it. Since this ticket seems to be tracking general decompression support, I've created a more focused implementation specific ticket: #539. I'm not sure when I personally will be able to work on it, but I would be happy to mentor it. I think anyone with some Rust experience could probably do it!

@ylluminarious
Copy link
Contributor

ylluminarious commented Sep 24, 2017

@BurntSushi This ticket and #539 have been stagnant for a while. Earlier in this thread you mentioned:

If we wanted to implement this right now, we'd need to bring in C libraries. While there's no hard requirement against doing this (to my knowledge, as long as the three main platforms are supported), I would like to actively discourage it in the interest of keeping ripgrep pure Rust.

Yet, given how long this feature has been stalled, should you not just bite the bullet and pull in the libraries?

This feature is a deal-breaker for myself, others that I know, and some in this very thread. It doesn't seem that keeping ripgrep "pure Rust" is worth the trouble at this point, esp. considering that this could have been implemented already.

Just my 2¢.

@BurntSushi
Copy link
Owner

You aren't considering the maintenance burden of bringing in C code. I'm not inclined to change course at this time.

@ylluminarious
Copy link
Contributor

@BurntSushi But it would be a small amount, no? Does Rust not support an FFI to C to ease this sort of thing?

Also, as an aside, I believe that the "burden" of C is very much overstated these days. It's not as bad as most people make it sound...

@lnicola
Copy link

lnicola commented Sep 24, 2017

@BurntSushi Maybe soon: rust-lang/flate2-rs#67

@ylluminarious

Also, as an aside, I believe that the "burden" of C is very much overstated these days. It's not as bad as most people make it sound...

You're probably right, but some people oppose to it on principle. Also, on Windows, it's still somewhat awkward at times (but those are bugs of course, and could/should be fixed).

@jdanford
Copy link
Contributor

@ylluminarious

But it would be a small amount, no?

You're probably right, so why don't you just submit a pull request that adds the necessary libraries and glue code?

@BurntSushi
Copy link
Owner

BurntSushi commented Sep 24, 2017 via email

@BurntSushi
Copy link
Owner

@jdanford

You're probably right, so why don't you just submit a pull request that adds the necessary libraries and glue code?

Let us please try to keep things friendly here. I would rather someone not put in the work to add this unless it has a good chance of getting merged (unless they are explicitly okay with doing it regardless of the result).

@BurntSushi
Copy link
Owner

I'd like to note that #539 exists as an implementation path, and the amount of work required to implement that is roughly on par with bringing in C libraries IMO.

See PR #305, which has more details.

@ylluminarious
Copy link
Contributor

@jdanford

why don't you just submit a pull request[...]?

As @BurntSushi mentioned, I do not want to put in time for such a feature unless it's likely to get accepted and merged.


@lnicola Thanks a lot for sharing rust-lang/flate2-rs#67 -- intriguing stuff with obvious usefulness here.

Also, yes, it is possible that things could be awkward on Windoze, but as you mention, hopefully that can be worked around / fixed.


@BurntSushi Thanks for adding that extra information on #305 and for your thoughts on #539. I had presumed that adding support via C would be easier at this point than via Rust, but that seems incorrect now.

@jdanford
Copy link
Contributor

Sorry for my snarky comment! It wasn't constructive, and it seems like everyone's on the same page now anyway.

@ylluminarious
Copy link
Contributor

@jdanford No offense taken -- it was a valid question.

@Dieken
Copy link

Dieken commented Jan 28, 2018

You can set environment variable GREP to "rg", then use zgrep/bzgrep/xzgrep -Hn to uncompress and search on .gz/.bz2/.xz/.lzma/.lzo with rg, I feel this is somewhat enough :-)

Notice /usr/bin/bz*grep and /usr/bin/z*grep on Mac OS X don't support environment variable GREP, BSD tools suck... You need install gzip and xz with Homebrew.

@BurntSushi
Copy link
Owner

For anyone watching this issue, the next release of ripgrep will have support for searching compressed files using the -z/--search-zip flag by shelling out to gzip/xz/lzma/bzip2 for decompression. Thanks to @balajisivaraman for implementing this! (See #751 and #767.)

I am going to keep this issue open to track in-process decompression, since I suspect we will ultimately want to move to that. However, I don't expect that to happen any time soon.

@BurntSushi BurntSushi changed the title Support decompression on the fly support searching compressed files using in-process decompression Jan 30, 2018
@Boscop
Copy link

Boscop commented Jan 30, 2018

Does it work on windows too when 7z is installed?

@ylluminarious
Copy link
Contributor

@BurntSushi Thanks for the news!

@BurntSushi
Copy link
Owner

@Boscop I don't know. You need the xz, gzip and bzip2 binaries. If they don't exist, then decompression doesn't happen. I tested this in a cygwin environment and it works. If a normal Windows environment requires additional binaries, then someone will need to put in the work to add that support.

@ylluminarious
Copy link
Contributor

@BurntSushi

You need the xz, gzip and bzip2 binaries.

We don't need all of those programs for any compressed file, do we? I assume you mean that we need a respective decompression program for a given file type.

@BurntSushi
Copy link
Owner

@ylluminarious Yes.

@Boscop
Copy link

Boscop commented Jan 30, 2018

7z.exe supports all of those.

@albfan
Copy link

albfan commented Feb 20, 2018

As maintainer make this configurable can give you lot of headaches, can resist to propose:

static ref DECOMPRESSION_COMMANDS: HashMap<

static ref SUPPORTED_COMPRESSION_FORMATS: GlobSet = {

so you can config new files to decompress:

--zip-command="jar:gzip -d -c"

@BurntSushi
Copy link
Owner

On second thought, I'm just going to close this issue, since there isn't much value in tracking it at this point. It will likely be a long time before in-process decompression happens.

@ylluminarious
Copy link
Contributor

@BurntSushi You should probably update your Anti-Pitch for ripgrep on your blog, saying that search decompression is now possible through external utilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon. question An issue that is lacking clarity on one or more points.
Projects
None yet
Development

No branches or pull requests