-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--format option for bytes/charecters of context with limit #414
Comments
Yes. The output of You could use It might be useful to consider updating ugrep to include a format parameter to |
It would be highly useful for us, our regex is exceptionally long as we are adding all of the PII info into one giant RE. (we had to recompile PCRE2 on our system to update the character limits to a 64bit word size) Anyhow it's like 12k characters or something that get's expanded by ugreps re optimization into like 34k. Anyhow ugrep churns through it pretty well, until we hit some weird bytes files that don't contain newline delimiters at all. We also use -F fixed strings with a -f fixed strings input files in some situations with everyones names. Both would benefit from output format limiting. Now we use the output context to extract the surrounding context bytes or string etc. to get a visual representation of what the match looks like as well as process each matching context for further refinement before displaying to the CLI. Initially I was using just the bytes start location, with dd to extract the context, however I quickly ran into issues with scans on compressed files (gz, zip, etc.) those are a lot harder to extract a bytes location from. So I wen't back to %Q and then was post processing the lines with some bash code to truncate to 32b{match}32b. Then we ran into the k9s bug. A team member ran our script against a directory that happened to have k9s, that user also happened to search for kube in their -f fixed string file. The context blew up. Example command to reproduce: We scan a lot of "random" files as a QC check for some internal processes and our script has to handle unknown text and binary files gracefully. your %[10]o pattern would make things a lot easier on us. Another option I was considering was some how detecting too much text being received by our script and killing ugrep in that event. |
Very interesting use case! Thanks for sharing! OK, so the large output is happening because we're matching very long lines and each match produces another long output with Note that We can implement an extension of the Would something like that work for you? And perhaps Caveat: the |
I would also like to add the ability to output a group capture in CSV, JSON or XML which is currently not possible. A group capture is output with
Now, we can also add a new field that specifies a context size to output a match. This can be combined with all of the above fields without adding unnecessary complexity to the arguments of these fields. These assume that option Note: |
How about: %[-n]O n chars before the match
%[+n]O n chars after the match
%[-n]Q quoted n chars before
%[+n]Q quoted n chars after
%[-n]C n chars before as C/C++
%[+n]C n chars after as C/C++
%[-n]J n chars before as JSON
%[+n]J n chars after as JSON
%[-n]X n chars before as XML
%[+n]X n chars after as XML
%[-n]V n chars before as CSV
%[+n]V n chars after as CSV Not 100% convinced yet on my end that this is a good approach. It's a bit much with all these listed doing the same thing, essentially. Also, should these be n chars or n bytes? Chars (Unicode) is probably more reasonable. Also, the number of chars before/after the match will be less than the given n when we hit the line's begin or end. We don't want to extend this beyond the line. |
OK. A lot going on here on my end to extend the format fields for options I'm getting more comfortable with adding a couple of new fields and a new field argument
With these new |
Will this work for you? I believe this is sufficiently flexible to cover many other use cases. |
Yes I think those options specifically the %[-n]o width params are exactly what I need. I will also likely make use of the group matching format support at some point as well! Thanks |
Ugrep v6.4 is released. |
Let me go check it out! Thanks |
Let me know if you have any questions. Since you're also searching binary files, you may want to exclude them with |
Hey tested the new functionality out and it's exactly what I needed. I'm using %o and then just replacing some of the non-printables as needed, which so far is working. Mostly \n, \r, \t, \f etc. if the come up. The k9s bin uses \f field delimiter internally for some kind of string sep in their source, so it's pretty obvious if the replacements aren't being done. Anyhow we want as much binary output contextually fed to users if it's a binary file. They can pump it to a hexdumper or something if they need to see what those bytes are in more detail. Overall I'm super happy with the new capability. Thank you |
Interesting use case. Thanks for sharing. One possible addition I thought of that could be useful is to add fields to output text matches/lines or output hex when the match/line is binary. Like option |
I'm stupid. We can just let options |
I have a situation where I am searching larger binary files like disc images and elf bins for PII strings. I have a format string that looks like "%[§]$%f%s%b%s%q%s%Q%~". This works well in many situations except sometimes the %Q, %O, %C outputs nearly the entire binary file... Probably due to the lack of endline characters.
Is there a way to do the equivalent of (%Q, %O, %C) that would do N bytes of context around the match? For example %Q32 would yield a limited C++ quoted escaped string of 32 bytes before and after a match.
Thanks
The text was updated successfully, but these errors were encountered: