Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use plugin analyze function in audit functionality #208

Merged
merged 6 commits into from
Aug 6, 2019

Conversation

OiCMudkips
Copy link
Contributor

@OiCMudkips OiCMudkips commented Jul 15, 2019

Old algo

The audit functionality used to use the following algorithm to get the plaintext of a secret:
(1) Take the secret type and get the right plugin PluginA.
(2) Take the secret file FileA and line number x.
(3) Get a list of secrets in FileA on line x using PluginA.secret_generator().
(4) Return the plaintext which matches the input secret's hash, or raise if the secret isn't found.

New algo

This PR changes the algorithm. Specifically, it replaces (3) with:

(a) Scan the whole file FileA for secrets using PluginA.analyze().

Benefits

The immediate motivation for this PR was #189.

This adjustment makes the audit functionality less flakey. Plugins can implement more specific behaviour regarding splitting lines in its analyze function, and this PR makes it so that we can use that information in audit as well, making it more likely we'll be able to rediscover the secret.

As @KevinHock also points out, this will allow audit to be more flexible, in that the secret can change lines, but audit will still find the secret.

Why not just run scan over the whole codebase again?

It would be a waste of time to scan literally every file with every plugin when we have a list of (file, plugin) pairs which actually yielded secrets in the baseline.

This approach was also easier to code IMO.

Why are there also random new tests?

Replacing the old algorithm resulted in less LOC in detect_secrets/, which reduced overall code coverage, which make test complained about :)

Notes for reviewing

get_raw_secret_value was completely rewritten so the Split code-review view is a lot more useful.

Auto close tags: Fixes #189.

@OiCMudkips OiCMudkips self-assigned this Jul 15, 2019
secret=raw_secret,
)
with codecs.open(filename, encoding='utf-8') as f:
plugin_secrets = plugin.analyze(f, filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like we're re-scanning the file per secret audited. How does this perform with large files?

Copy link
Collaborator

@KevinHock KevinHock Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we can avoid this by either changing or not using _secret_generator before calling this method

def _secret_generator(baseline):
"""Generates secrets to audit, from the baseline"""
for filename, secrets in baseline['results'].items():
for secret in secrets:
yield filename, secret

Right now we call the method containing codecs.open on every secret, when we only need to call it for each key in baseline['results'].

@domanchi
Copy link
Contributor

FWIW, the audit functionality used line numbers as an optimization: there was no point re-scanning every line in the file, if you knew exactly which line to go to. Furthermore, since the baseline was ordered by increasing line number, the file would only have to be read once.

After all, the audit tool is an automated tool for the manual effort of opening each file, go to the line, find the secret yourself. These changes seem to cause a performance hit, and the benefit is not super clear.

@OiCMudkips
Copy link
Contributor Author

OiCMudkips commented Jul 16, 2019

As an aside, I don't know if the file is only read once as the code is written now?

In audit audit_baseline, we _print_context for every secret. _print_context calls _get_secret_with_context calls CodeSnippetHighlighter. get_code_snippet calls CodeSnippetHighlighter. _get_lines_in_file which open the whole file. I didn't see any caching in this chain.

Doing some Googling, it seems like there's a standard module for opening a single line but it doesn't seem to be in use here. I don't know that we actually want to include this module considering it seems to be couple with the traceback module and that it's only >= py35.


To clarify, the benefit here is to fix #189. Specifically, the high-entropy plugin has filetype-specific logic that was lost in the audit mode, since audit calls a different function from scan to find secrets on a line, and this PR tries to fix that by just calling the plugin's analyze function, which has all the context.

I agree that this change is opening and scanning the whole file, and this is a performance hit versus the status quo of opening the whole file and scanning 1 line. Let me think of a way to avoid this.

@OiCMudkips
Copy link
Contributor Author

I think @KevinHock also has some reasons for why we would potentially want this in the comments of #189.

@KevinHock
Copy link
Collaborator

KevinHock commented Jul 17, 2019

While I agree we should check the performance of this, / improve it however we can, I think the benefits of this make it worth doing:

Firstly, the statements

there was no point re-scanning every line in the file, if you knew exactly which line to go to.

and

The line number does not play a part in the identification of a potential secret because code is expected to move around through continuous iteration.

Kind of contradict each other. Even if we explicitly call it out in our docs, it is beneficial to some degree to not rely on line numbers for auditing, and to be able to not have to tell the user about something somewhat unintuitive.

Secondly, it is blocking the only complicated part of baseline diff minimization (#92), which is a good option to have for an internal reason, and we've had at least one external person ask us for it.

Lastly and most importantly, it will improve maintainability due to preventing any bugs like #189 in the future.

As an aside, people may very well scan thousands of repos in a cron like fashion with detect-secrets, however nobody audit's thousands of repos in a cron like fashion, so perf when scanning is significantly more important.

Copy link
Collaborator

@KevinHock KevinHock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just the one _secret_generator comment change left I think, that @domanchi pointed out.

@@ -203,7 +203,7 @@ def _analyze_yaml_file(self, file, filename):
item = to_search.pop()

try:
if '__line__' in item and not item['__line__'] in ignored_lines:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

def get_audited_baseline(self, plugin_config, is_secret):
def get_audited_baseline(
self,
plugins_used=[{'name': 'HexHighEntropyString'}],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ack'ing that I saw the default keyword arg value was a list, and it seems apropos in this case. (That is a great picture by the way, @kennethreitz)

# NOTE: The first config here needs to be
# the HexHighEntropyString config for this test to work.
[{'name': 'HexHighEntropyString'}], # plugin w/o config
[{'name': 'HexHighEntropyString', 'hex_limit': 2}], # plguin w/config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/plguin/plugin/g

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, fixed

secret=raw_secret,
)
with codecs.open(filename, encoding='utf-8') as f:
plugin_secrets = plugin.analyze(f, filename)
Copy link
Collaborator

@KevinHock KevinHock Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we can avoid this by either changing or not using _secret_generator before calling this method

def _secret_generator(baseline):
"""Generates secrets to audit, from the baseline"""
for filename, secrets in baseline['results'].items():
for secret in secrets:
yield filename, secret

Right now we call the method containing codecs.open on every secret, when we only need to call it for each key in baseline['results'].

Victor Zhou added 5 commits August 1, 2019 15:58
This reduces the flakiness of the code coverage check. In particular,
this code was covered in py37 but not previous versions. Also, this
just makes it easier to pass the coverage test.
In audit mode, we were using secret_generator to find the
plaintext secret. This was problematic because it led to
inconsistences between what the scan functionality would
find and what the audit functionality would find, leading
to user errors.

Using the same plugin function for both scan and audit
will lead to fewer user errors.
@OiCMudkips OiCMudkips force-pushed the fix_unfound_highentropy_secret branch from 37721d4 to fc52867 Compare August 1, 2019 23:35
@OiCMudkips
Copy link
Contributor Author

This branch is now based off #213 and #213 should be reviewed before this.

@OiCMudkips
Copy link
Contributor Author

With this in mind, you actually want to review 0d499fd (Make a line of code more pythonic) and later commits.

Copy link
Contributor

@domanchi domanchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we're scanning the entire file per secret audited still, but at least we save on disk IO. IIRC, our plugins are generally fast enough, so WFM I guess.

Might be worth ticketing for a P3 improvement.

@OiCMudkips
Copy link
Contributor Author

OiCMudkips commented Aug 6, 2019

In c8b60c7 I added a new test for the force-show-secret feature. This was because fc52867 failing the coverage test in Python 2. But, this test didn't actually help me increase coverage. The actual fix was to add (object) to CodeSnippet's declaration.

Why wasn't Python 3 failing coverage tests too? This was biting us but it only applies for Python 2.

@OiCMudkips OiCMudkips merged commit c3ddf38 into Yelp:master Aug 6, 2019
killuazhu pushed a commit to IBM/detect-secrets that referenced this pull request May 28, 2020
killuazhu pushed a commit to IBM/detect-secrets that referenced this pull request Jul 9, 2020
killuazhu pushed a commit to IBM/detect-secrets that referenced this pull request Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Audit fails to find non-quoted base64 strings
3 participants