-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use plugin analyze
function in audit functionality
#208
Use plugin analyze
function in audit functionality
#208
Conversation
detect_secrets/core/audit.py
Outdated
secret=raw_secret, | ||
) | ||
with codecs.open(filename, encoding='utf-8') as f: | ||
plugin_secrets = plugin.analyze(f, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like we're re-scanning the file per secret audited. How does this perform with large files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we can avoid this by either changing or not using _secret_generator
before calling this method
detect-secrets/detect_secrets/core/audit.py
Lines 288 to 292 in 2b40c99
def _secret_generator(baseline): | |
"""Generates secrets to audit, from the baseline""" | |
for filename, secrets in baseline['results'].items(): | |
for secret in secrets: | |
yield filename, secret |
Right now we call the method containing codecs.open
on every secret, when we only need to call it for each key in baseline['results']
.
FWIW, the audit functionality used line numbers as an optimization: there was no point re-scanning every line in the file, if you knew exactly which line to go to. Furthermore, since the baseline was ordered by increasing line number, the file would only have to be read once. After all, the audit tool is an automated tool for the manual effort of opening each file, go to the line, find the secret yourself. These changes seem to cause a performance hit, and the benefit is not super clear. |
As an aside, I don't know if the file is only read once as the code is written now? In audit Doing some Googling, it seems like there's a standard module for opening a single line but it doesn't seem to be in use here. I don't know that we actually want to include this module considering it seems to be couple with the To clarify, the benefit here is to fix #189. Specifically, the high-entropy plugin has filetype-specific logic that was lost in the audit mode, since I agree that this change is opening and scanning the whole file, and this is a performance hit versus the status quo of opening the whole file and scanning 1 line. Let me think of a way to avoid this. |
I think @KevinHock also has some reasons for why we would potentially want this in the comments of #189. |
While I agree we should check the performance of this, / improve it however we can, I think the benefits of this make it worth doing: Firstly, the statements
and Kind of contradict each other. Even if we explicitly call it out in our docs, it is beneficial to some degree to not rely on line numbers for auditing, and to be able to not have to tell the user about something somewhat unintuitive. Secondly, it is blocking the only complicated part of baseline diff minimization (#92), which is a good option to have for an internal reason, and we've had at least one external person ask us for it. Lastly and most importantly, it will improve maintainability due to preventing any bugs like #189 in the future. As an aside, people may very well scan thousands of repos in a cron like fashion with detect-secrets, however nobody audit's thousands of repos in a cron like fashion, so perf when scanning is significantly more important. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just the one _secret_generator
comment change left I think, that @domanchi pointed out.
@@ -203,7 +203,7 @@ def _analyze_yaml_file(self, file, filename): | |||
item = to_search.pop() | |||
|
|||
try: | |||
if '__line__' in item and not item['__line__'] in ignored_lines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
def get_audited_baseline(self, plugin_config, is_secret): | ||
def get_audited_baseline( | ||
self, | ||
plugins_used=[{'name': 'HexHighEntropyString'}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ack'ing that I saw the default keyword arg value was a list, and it seems apropos in this case. (That is a great picture by the way, @kennethreitz)
tests/core/audit_test.py
Outdated
# NOTE: The first config here needs to be | ||
# the HexHighEntropyString config for this test to work. | ||
[{'name': 'HexHighEntropyString'}], # plugin w/o config | ||
[{'name': 'HexHighEntropyString', 'hex_limit': 2}], # plguin w/config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/plguin/plugin/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, fixed
detect_secrets/core/audit.py
Outdated
secret=raw_secret, | ||
) | ||
with codecs.open(filename, encoding='utf-8') as f: | ||
plugin_secrets = plugin.analyze(f, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we can avoid this by either changing or not using _secret_generator
before calling this method
detect-secrets/detect_secrets/core/audit.py
Lines 288 to 292 in 2b40c99
def _secret_generator(baseline): | |
"""Generates secrets to audit, from the baseline""" | |
for filename, secrets in baseline['results'].items(): | |
for secret in secrets: | |
yield filename, secret |
Right now we call the method containing codecs.open
on every secret, when we only need to call it for each key in baseline['results']
.
db8010e
to
37721d4
Compare
This reduces the flakiness of the code coverage check. In particular, this code was covered in py37 but not previous versions. Also, this just makes it easier to pass the coverage test.
In audit mode, we were using secret_generator to find the plaintext secret. This was problematic because it led to inconsistences between what the scan functionality would find and what the audit functionality would find, leading to user errors. Using the same plugin function for both scan and audit will lead to fewer user errors.
37721d4
to
fc52867
Compare
With this in mind, you actually want to review 0d499fd (Make a line of code more pythonic) and later commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we're scanning the entire file per secret audited still, but at least we save on disk IO. IIRC, our plugins are generally fast enough, so WFM I guess.
Might be worth ticketing for a P3 improvement.
In c8b60c7 I added a new test for the force-show-secret feature. This was because fc52867 failing the coverage test in Python 2. But, this test didn't actually help me increase coverage. The actual fix was to add Why wasn't Python 3 failing coverage tests too? This was biting us but it only applies for Python 2. |
Old algo
The audit functionality used to use the following algorithm to get the plaintext of a secret:
(1) Take the secret type and get the right plugin
PluginA
.(2) Take the secret file
FileA
and line numberx
.(3) Get a list of secrets in
FileA
on linex
usingPluginA.secret_generator()
.(4) Return the plaintext which matches the input secret's hash, or raise if the secret isn't found.
New algo
This PR changes the algorithm. Specifically, it replaces (3) with:
(a) Scan the whole file
FileA
for secrets usingPluginA.analyze()
.Benefits
The immediate motivation for this PR was #189.
This adjustment makes the
audit
functionality less flakey. Plugins can implement more specific behaviour regarding splitting lines in itsanalyze
function, and this PR makes it so that we can use that information inaudit
as well, making it more likely we'll be able to rediscover the secret.As @KevinHock also points out, this will allow
audit
to be more flexible, in that the secret can change lines, butaudit
will still find the secret.Why not just run
scan
over the whole codebase again?It would be a waste of time to scan literally every file with every plugin when we have a list of
(file, plugin)
pairs which actually yielded secrets in the baseline.This approach was also easier to code IMO.
Why are there also random new tests?
Replacing the old algorithm resulted in less LOC in
detect_secrets/
, which reduced overall code coverage, whichmake test
complained about :)Notes for reviewing
get_raw_secret_value
was completely rewritten so theSplit
code-review view is a lot more useful.Auto close tags: Fixes #189.