filter out duplicate results #357

timotheecour · 2017-02-11T23:17:26Z

rg pattern dir1/dir2 dir1
return duplicate results

dir1/dir2/file_containin_pattern
dir1/dir2/file_containin_pattern

could we filter out duplicate results? (at least as an option)

the use case for dir1/dir2 dir1 is to list relevant files in dir1/dir2 ahead of the other ones in dir1

The text was updated successfully, but these errors were encountered:

BurntSushi · 2017-02-11T23:25:07Z

Unlikely. The reason is that determining which results are duplicates is actually quite hard despite the conceptual simplicity. (The file paths alone don't tell the whole story.)

One possible work-around for you is to execute two searches:

$ rg pattern dir1/dir2
$ rg pattern -g '!dir1/dir2' dir1

The second command won't search dir1/dir2, which achieves what you want I think.

BurntSushi · 2017-02-11T23:26:34Z

If you're curious about what goes into determining whether two files are the same or not, you can check out the same-file crate, which ripgrep does use in some limited circumstances (unrelated to this feature request though).

timotheecour · 2017-02-13T02:02:32Z

@BurntSushi thanks for the link. After looking at implementation, I wonder why posix function realpath can't be used (on linux/osx) or GetFullPathName (on windows)?

eg: using d syntax which i'm more familiar with:

auto is_same_file(string a, string b){return realPath(a)==realPath(b);}
// realPath is a simple D wrapper around posix realpath from C

still though, a partially working filtering would be better than nothing at all, maybe with proper caveats in docs or naming it --experimental-filter-dups (and maybe only on posix for now)

BurntSushi · 2017-02-13T03:13:17Z

Computing the real path of every file is extremely expensive. This option is also not possible to implement in constant memory. It is a non starter.

I'm not particularly interested in adding experimental half baked features.

BurntSushi · 2017-02-13T03:14:42Z

I think the answer here is to work around it. If you tell the tool to search the same directory twice, then it should show the same results twice.

timotheecour · 2017-02-17T18:33:51Z

@BurntSushi

honest question: is calling realpath on a file really more expensive than opening the file and doing the regex search? It would obviously only be run on the files that pass the various file filters ripgrep already uses

This option is also not possible to implement in constant memory

how is that different from --sort-files in that respect? --sort-fileshas to use O(N) memory where N=number of output lines produced

EDIT: actually since we're searching a file-system tree, I guess we can do better than O(N) memory by doing breadth first listing and sorting each directory so it'd be O(D) where D=max number of entries per directory; ignore last point.
Indeed, I agree it dedup across symlinks has to be O(N) because each output file is potentially a symlink to a previous output file. Still though, I would not expect that to be a real problem for most cases: number of output files produced by an rg search should typically fit in memory.

mqudsi · 2019-07-09T19:52:37Z

@BurntSushi I agree with your reasoning 100% on why rg shouldn't try to deduplicate two textually different paths, however I am wondering if there can be an option to prevent the following behavior:

> cd (mktemp -d)
> echo foo > ./foo
> rg . -l ./ ./foo
./foo
./foo

without needing to pipe its output to another command to deal with (to avoid buffering and to be able to use --color always without needing to decode then re-add the colors during the filter stage).

BurntSushi · 2019-07-09T19:55:25Z

I don't think it's ripgrep's responsibility to deduplicate the arguments you give it. If you don't want it searching the same directory twice, then don't give the same directory twice. i.e., Do something to deduplicate the arguments before handing them to ripgrep.

mqudsi · 2019-07-09T19:56:55Z

That's certainly fair enough; I was just checking if you would be open to having this be an option within ripgrep itself.

mqudsi · 2019-07-10T15:55:41Z

(I just saw your reply again by chance and it occurred to me to point out that, strictly speaking, the inputs to rg are deduplicated, but the problem is that they are nested.)

BurntSushi closed this as completed Feb 13, 2017

timotheecour mentioned this issue Feb 14, 2018

fix issue #359 --machine-readable #802

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter out duplicate results #357

filter out duplicate results #357

timotheecour commented Feb 11, 2017

BurntSushi commented Feb 11, 2017

BurntSushi commented Feb 11, 2017 •

edited

Loading

timotheecour commented Feb 13, 2017 •

edited

Loading

BurntSushi commented Feb 13, 2017

BurntSushi commented Feb 13, 2017

timotheecour commented Feb 17, 2017 •

edited

Loading

mqudsi commented Jul 9, 2019

BurntSushi commented Jul 9, 2019

mqudsi commented Jul 9, 2019

mqudsi commented Jul 10, 2019

filter out duplicate results #357

filter out duplicate results #357

Comments

timotheecour commented Feb 11, 2017

BurntSushi commented Feb 11, 2017

BurntSushi commented Feb 11, 2017 • edited Loading

timotheecour commented Feb 13, 2017 • edited Loading

BurntSushi commented Feb 13, 2017

BurntSushi commented Feb 13, 2017

timotheecour commented Feb 17, 2017 • edited Loading

mqudsi commented Jul 9, 2019

BurntSushi commented Jul 9, 2019

mqudsi commented Jul 9, 2019

mqudsi commented Jul 10, 2019

BurntSushi commented Feb 11, 2017 •

edited

Loading

timotheecour commented Feb 13, 2017 •

edited

Loading

timotheecour commented Feb 17, 2017 •

edited

Loading