Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nimgrep improvements 2 #15612

Merged
merged 20 commits into from
Nov 9, 2020
Merged

Nimgrep improvements 2 #15612

merged 20 commits into from
Nov 9, 2020

Conversation

a-mr
Copy link
Contributor

@a-mr a-mr commented Oct 17, 2020

I'm using nimgrep a lot in my daily job, and I added a few useful things.

  • Speed up search by using multiple threads (options --nWorkers:N, -n:N)
  • UPDATED: implement pipe mode -, which allows for using nimgrep in Unix filters.
  • Ability to limit output by cropping lines at a fixed number of 8-bit characters (--limit:N, -m:N) or by terminal width (--fit). See example on picture below. It's useful when dealing with binary files with long lines or when printing on small screens, since line breaks mess up visual perception.
  • Better displaying of non-printable characters by substituted ASCII characters with colors (--onlyAscii, -o). It's useful for displaying binary files, especially with the limit options above.
  • Ability to detect binary files and exclude them from the search (--bin:no = --text, -t) or include only them (--bin:only)
  • Add --sortTime option to be able e.g. search starting from recently changed files — when you remember that what you are looking for is recent
  • The order in which files appear is fixed now: it's alphabetic sorting by default (unless --sortTime is specified). I want the ability to expect that files will appear in stable order any time to memorize where should I look to; while Linux directory walking yields you files and directories non-deterministically. It's especially important with threads that would mess up the order if they were given the chance. (However this increases memory consumption in multi-thread mode since search results need to be stored in memory until contiguous sequence of files is received from thread workers)
  • Add --match and --noMatch option. Useful when you want to search only specific files based on presence of string/match in them, e.g. an appropriate import statement, or copyright year, or language extensions/features/versions for such languages...
  • Proper printing of TABs — useful e.g. when searching through some C code bases like linux kernel
  • Add --count option for displaying only number of matches in files
  • add --includeDir option for limiting search only to specific directories (e.g. '.*tests' if your project have such a convention for unit tests)
  • UPDATED: (probably breaking) replacement mode now requires specifying all 3 positional arguments, since it seems too dangerous that something will be omitted and works not as expecting, especially considering that --confirm is not default. Once playing with --replace I made a mistake of forgetting specify any replacement, and nimgrep accepted it as an empty string "", basically deleting all patterns. Targets also should be specified explicitly: current dir "." is too dangerous for a default setting.
  • (possibly a breaking change?) if no directory is specified then "." is used instead of the fullpath, which was redundant IMHO
  • Peg can be used for all patterns now (including --includeFile, --match, etc)
  • fix performance regression introduced in nimgrep improvements nimgrep improvements #12779, which slowed down search by 10—15 %.
  • switch to --gc:orc
  • miscellaneous: better error reporting. added extended msg for failed library loads w/ incorrect DLL formats #13950 is fixed in another way (CC @genotrance). Fix Fix Nimgrep bug #15318 (CC @juancarlospaco).
  • UPDATED: --help is reworked and better structured.

image

A note on thread performance: the speed up with regex is not very great, it's limited to factor 2.5—3. Basically after reaching -n:3 or -n:4 threads performance stops to increase. I debugged that with gdb, it seems that the program with large number of threads (e.g. -n:20) spends almost all its time in syscalls read, that is reading the files into the process memory. I don't know whether it's gotten memory-bound or Linux is unable to deliver file contents because of some internal locks. When using (slow) Peg pattern, performance increases nearly linearly: I got speed-up 5.5—5.8 on my 6-core Ryzen 3600 with -n:6 and -n:12.

With --gc:orc speed is about the same as with --gc:markandsweep. But with orc nimgrep consumes ~2 times less memory in multi-threaded mode: 170 MB with -n:6 on Nim repository, while it is 370 MB with --gc:markandsweep. (Measure by Linux utility /usr/bin/time --verbose).

@a-mr
Copy link
Contributor Author

a-mr commented Oct 18, 2020

The compilation failure with nim cpp test on MacOS is related to using terminalWidth:
https://dev.azure.com/nim-lang/Nim/_build/results?buildId=8706&view=logs&j=a0440cd6-2060-5545-8b53-639e777de0c6&t=e8126762-32e4-52a6-97de-95cf0bedbe3d&l=2144

Undefined symbols for architecture x86_64:
  "ctermid(char*)", referenced from:
      terminalWidth__s9bdJLFVciGfQ0i2sQaMF1g() in stdlib_terminal.nim.cpp.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Error: execution of an external program failed: 'clang++   -o /Users/runner/work/1/s/tools/nimgrep  /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_assertions.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_locks.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_sharedlist.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_io.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_system.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_parseutils.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_math.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_algorithm.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_unicode.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_strutils.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_pathnorm.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_posix.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_times.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_os.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_parseopt.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_pegs.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_pcre.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_rtarrays.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_re.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_strformat.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_terminal.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_cpuinfo.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_osproc.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/stdlib_tables.nim.cpp.o /Users/runner/work/1/s/nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d/@mnimgrep.nim.cpp.o  -lm   -ldl'

I cannot reproduce this problem in MacOS Mojave Virtualbox VM. When I run bin/nim cpp --hints:on -d:testing --nimblePath:tests/deps --debugger:on --nimCache:nimcache/tools/nimgrep_50b398663f4fa8506176c89182a8ee4d --stdout '--hint[Path]:off' '--hint[Processing]:off' tools/nimgrep, it compiles successfully.

@a-mr
Copy link
Contributor Author

a-mr commented Oct 18, 2020

No, adding -lc did not help :-(

@a-mr
Copy link
Contributor Author

a-mr commented Oct 19, 2020

reproduced with mac OS X 10.15 Catalina.

@a-mr
Copy link
Contributor Author

a-mr commented Oct 19, 2020

I checked the generated code, stdlib_terminal.cpp, and managed to boil the failure down to just 2 lines.

It's either a bug in clang++ compiler or in the headers, the bug appears only when unistd.h is included BEFORE stdio.h (where ctermid is defined)

This compiles:

#include <stdio.h>
#include <unistd.h>  // <- may comment this line, it doesn't matter
int main() {
    ctermid(NULL);
}

This does not:

#include <unistd.h>
#include <stdio.h>
int main() {
    ctermid(NULL);
}

image

If I compile with -c and then look into the *.o file by nm, then symbol _ctermid is not present indeed, so it's not a linker problem.

tools/nimgrep.nim.cfg Outdated Show resolved Hide resolved
@Araq
Copy link
Member

Araq commented Oct 21, 2020

I only skimmed this patch and hope @timotheecour does reivew it with more care than I did.

Oh and thanks for the awesome work!

tools/nimgrep.nim Outdated Show resolved Hide resolved
tools/nimgrep.nim Outdated Show resolved Hide resolved
tools/nimgrep.nim Outdated Show resolved Hide resolved
tools/nimgrep.nim Outdated Show resolved Hide resolved
tools/nimgrep.nim Outdated Show resolved Hide resolved
tools/nimgrep.nim Outdated Show resolved Hide resolved
tools/nimgrep.nim Outdated Show resolved Hide resolved
--limit[:N], -m[:N] limit max width of lines from files by N characters (80)
--fit calculate --limit from terminal width for every line
--onlyAscii, -o use only printable ASCII Latin characters 0x20-0x7E
(substitutions: 0 -> @, 1-0x1F -> A-_, 0x7F-0xFF -> !)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhat serious bug: $ nimgrep x bin/nim hangs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot reproduce :-/

What's your OS? Could you compile with --gc:markandsweep (I don't quite trust my recent change to --gc:orc)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repro:
I'm on OSX, using iterm

XDG_CONFIG_HOME= $nim_140_X c tools/nimgrep.nim # (XDG_CONFIG_HOME to isolate from my environment)
tools/nimgrep x $nim_140_X

adding --gc:markandsweep or --gc:arc makes no difference

  • this hangs: tools/nimgrep x $nim_140_X
  • this does not hang: tools/nimgrep x $nim_140_X | wc -l
  • this does not hang: tools/nimgrep --onlyAscii $nim_140_X | wc -l

this suggests that the culprit is binary data being printed to the terminal are being interpreted by the terminal, causing arbitrary behavior including hanging.

This should not be the default behavior (printing binary data to terminal can have potentially dangerous / malicious consequences), so I suggest making --onlyAscii:on the default, and users that know what they're doing can us explicit --onlyAscii:off to disable

We can bikeshed the --onlyAscii name separately

tools/nimgrep.nim Outdated Show resolved Hide resolved
tools/nimgrep.nim Outdated Show resolved Hide resolved
tools/nimgrep.nim Outdated Show resolved Hide resolved
tools/nimgrep.nim Outdated Show resolved Hide resolved
tools/nimgrep.nim Outdated Show resolved Hide resolved
tools/nimgrep.nim Outdated Show resolved Hide resolved
tools/nimgrep.nim Outdated Show resolved Hide resolved
@a-mr
Copy link
Contributor Author

a-mr commented Nov 8, 2020

@timotheecour , thank you for great review!

I think it's finished.
There are 2 problems left unaddressed:

  • hang with MacOS iterm. I guess it's because of iterm. Unfortunately I did not have time to look into that.
  • code bloating because of iterators with multiple yields. I tried to tackle that one, but mostly failed because rewriting in one-yield style requires to obfuscate the logic to some extent. (I would rather change to 1st-class iterators, but attempt to do so caused compiler's internal errors and segfaults - I will try to make a minimal reproducer and report a bug.) For now I consider the problem as not critical: nimgrep size increases from 500KB to 800KB. However there is a more annoying problem of drastic increase in compilation time. Let us address this problem in future work.

@timotheecour
Copy link
Member

LGTM, thanks for your patience!

code bloating because of iterators with multiple yields

ya, leave it as is; hopefully nim-lang/fusion#32 gets merged soon and then you can reuse it, simplifying code and avoiding multiple yield in the process, among other benefits.

hang with MacOS iterm

ok to leave as TODO; for future reference: #15612 (comment)

I'd also like something like BurntSushi/ripgrep#1727 but that can be done in future work

@Araq Araq merged commit 5db181f into nim-lang:devel Nov 9, 2020
PMunch pushed a commit to PMunch/Nim that referenced this pull request Jan 6, 2021
* nimgrep: speed up by threads and Channels
* nimgrep: add --bin, --text, --count options
* nimgrep: add --sortTime option
* allow Peg in all matches
including --includeFile, --excludeFile, --excludeDir

* add --match and --noMatch options
* add --includeDir option
* add --limit (-m) and --onlyAscii (-o) options
* fix performance regression

introduced in nimgrep improvements nim-lang#12779

* better error handling
* add option --fit
* fix groups in --replace
* fix flushing, --replace, improve --count
* use "." as the default directory, not full path
* fix --fit for Windows
* force target to C for macosx
* validate non-negative int input for options nim-lang#15318
* switch nimgrep to using --gc:orc
* address review: implement cropping in matches,...
* implement stdin/pipe & revise --help
* address stylistic review & add limitations
mildred pushed a commit to mildred/Nim that referenced this pull request Jan 11, 2021
* nimgrep: speed up by threads and Channels
* nimgrep: add --bin, --text, --count options
* nimgrep: add --sortTime option
* allow Peg in all matches
including --includeFile, --excludeFile, --excludeDir

* add --match and --noMatch options
* add --includeDir option
* add --limit (-m) and --onlyAscii (-o) options
* fix performance regression

introduced in nimgrep improvements nim-lang#12779

* better error handling
* add option --fit
* fix groups in --replace
* fix flushing, --replace, improve --count
* use "." as the default directory, not full path
* fix --fit for Windows
* force target to C for macosx
* validate non-negative int input for options nim-lang#15318
* switch nimgrep to using --gc:orc
* address review: implement cropping in matches,...
* implement stdin/pipe & revise --help
* address stylistic review & add limitations
irdassis pushed a commit to irdassis/Nim that referenced this pull request Mar 16, 2021
* nimgrep: speed up by threads and Channels
* nimgrep: add --bin, --text, --count options
* nimgrep: add --sortTime option
* allow Peg in all matches
including --includeFile, --excludeFile, --excludeDir

* add --match and --noMatch options
* add --includeDir option
* add --limit (-m) and --onlyAscii (-o) options
* fix performance regression

introduced in nimgrep improvements nim-lang#12779

* better error handling
* add option --fit
* fix groups in --replace
* fix flushing, --replace, improve --count
* use "." as the default directory, not full path
* fix --fit for Windows
* force target to C for macosx
* validate non-negative int input for options nim-lang#15318
* switch nimgrep to using --gc:orc
* address review: implement cropping in matches,...
* implement stdin/pipe & revise --help
* address stylistic review & add limitations
ardek66 pushed a commit to ardek66/Nim that referenced this pull request Mar 26, 2021
* nimgrep: speed up by threads and Channels
* nimgrep: add --bin, --text, --count options
* nimgrep: add --sortTime option
* allow Peg in all matches
including --includeFile, --excludeFile, --excludeDir

* add --match and --noMatch options
* add --includeDir option
* add --limit (-m) and --onlyAscii (-o) options
* fix performance regression

introduced in nimgrep improvements nim-lang#12779

* better error handling
* add option --fit
* fix groups in --replace
* fix flushing, --replace, improve --count
* use "." as the default directory, not full path
* fix --fit for Windows
* force target to C for macosx
* validate non-negative int input for options nim-lang#15318
* switch nimgrep to using --gc:orc
* address review: implement cropping in matches,...
* implement stdin/pipe & revise --help
* address stylistic review & add limitations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants