Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter out duplicate results #357

Closed
timotheecour opened this issue Feb 11, 2017 · 10 comments
Closed

filter out duplicate results #357

timotheecour opened this issue Feb 11, 2017 · 10 comments

Comments

@timotheecour
Copy link

rg pattern dir1/dir2 dir1
return duplicate results

dir1/dir2/file_containin_pattern
dir1/dir2/file_containin_pattern

could we filter out duplicate results? (at least as an option)

the use case for dir1/dir2 dir1 is to list relevant files in dir1/dir2 ahead of the other ones in dir1

@BurntSushi
Copy link
Owner

Unlikely. The reason is that determining which results are duplicates is actually quite hard despite the conceptual simplicity. (The file paths alone don't tell the whole story.)

One possible work-around for you is to execute two searches:

$ rg pattern dir1/dir2
$ rg pattern -g '!dir1/dir2' dir1

The second command won't search dir1/dir2, which achieves what you want I think.

@BurntSushi
Copy link
Owner

BurntSushi commented Feb 11, 2017

If you're curious about what goes into determining whether two files are the same or not, you can check out the same-file crate, which ripgrep does use in some limited circumstances (unrelated to this feature request though).

@timotheecour
Copy link
Author

timotheecour commented Feb 13, 2017

  • @BurntSushi thanks for the link. After looking at implementation, I wonder why posix function realpath can't be used (on linux/osx) or GetFullPathName (on windows)?

eg: using d syntax which i'm more familiar with:

auto is_same_file(string a, string b){return realPath(a)==realPath(b);}
// realPath is a simple D wrapper around posix realpath from C
  • still though, a partially working filtering would be better than nothing at all, maybe with proper caveats in docs or naming it --experimental-filter-dups (and maybe only on posix for now)

@BurntSushi
Copy link
Owner

Computing the real path of every file is extremely expensive. This option is also not possible to implement in constant memory. It is a non starter.

I'm not particularly interested in adding experimental half baked features.

@BurntSushi
Copy link
Owner

I think the answer here is to work around it. If you tell the tool to search the same directory twice, then it should show the same results twice.

@timotheecour
Copy link
Author

timotheecour commented Feb 17, 2017

@BurntSushi

  • honest question: is calling realpath on a file really more expensive than opening the file and doing the regex search? It would obviously only be run on the files that pass the various file filters ripgrep already uses

This option is also not possible to implement in constant memory

  • how is that different from --sort-files in that respect? --sort-fileshas to use O(N) memory where N=number of output lines produced

EDIT: actually since we're searching a file-system tree, I guess we can do better than O(N) memory by doing breadth first listing and sorting each directory so it'd be O(D) where D=max number of entries per directory; ignore last point.
Indeed, I agree it dedup across symlinks has to be O(N) because each output file is potentially a symlink to a previous output file. Still though, I would not expect that to be a real problem for most cases: number of output files produced by an rg search should typically fit in memory.

@mqudsi
Copy link

mqudsi commented Jul 9, 2019

@BurntSushi I agree with your reasoning 100% on why rg shouldn't try to deduplicate two textually different paths, however I am wondering if there can be an option to prevent the following behavior:

> cd (mktemp -d)
> echo foo > ./foo
> rg . -l ./ ./foo
./foo
./foo

without needing to pipe its output to another command to deal with (to avoid buffering and to be able to use --color always without needing to decode then re-add the colors during the filter stage).

@BurntSushi
Copy link
Owner

I don't think it's ripgrep's responsibility to deduplicate the arguments you give it. If you don't want it searching the same directory twice, then don't give the same directory twice. i.e., Do something to deduplicate the arguments before handing them to ripgrep.

@mqudsi
Copy link

mqudsi commented Jul 9, 2019

That's certainly fair enough; I was just checking if you would be open to having this be an option within ripgrep itself.

@mqudsi
Copy link

mqudsi commented Jul 10, 2019

(I just saw your reply again by chance and it occurred to me to point out that, strictly speaking, the inputs to rg are deduplicated, but the problem is that they are nested.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants