Use plugin `analyze` function in audit functionality #208

OiCMudkips · 2019-07-15T22:42:25Z

Old algo

The audit functionality used to use the following algorithm to get the plaintext of a secret:
(1) Take the secret type and get the right plugin PluginA.
(2) Take the secret file FileA and line number x.
(3) Get a list of secrets in FileA on line x using PluginA.secret_generator().
(4) Return the plaintext which matches the input secret's hash, or raise if the secret isn't found.

New algo

This PR changes the algorithm. Specifically, it replaces (3) with:

(a) Scan the whole file FileA for secrets using PluginA.analyze().

Benefits

The immediate motivation for this PR was #189.

This adjustment makes the audit functionality less flakey. Plugins can implement more specific behaviour regarding splitting lines in its analyze function, and this PR makes it so that we can use that information in audit as well, making it more likely we'll be able to rediscover the secret.

As @KevinHock also points out, this will allow audit to be more flexible, in that the secret can change lines, but audit will still find the secret.

Why not just run `scan` over the whole codebase again?

It would be a waste of time to scan literally every file with every plugin when we have a list of (file, plugin) pairs which actually yielded secrets in the baseline.

This approach was also easier to code IMO.

Why are there also random new tests?

Replacing the old algorithm resulted in less LOC in detect_secrets/, which reduced overall code coverage, which make test complained about :)

Notes for reviewing

get_raw_secret_value was completely rewritten so the Split code-review view is a lot more useful.

Auto close tags: Fixes #189.

domanchi · 2019-07-16T05:14:01Z

detect_secrets/core/audit.py

-            secret=raw_secret,
-        )
+    with codecs.open(filename, encoding='utf-8') as f:
+        plugin_secrets = plugin.analyze(f, filename)


This looks like we're re-scanning the file per secret audited. How does this perform with large files?

It seems like we can avoid this by either changing or not using _secret_generator before calling this method

detect-secrets/detect_secrets/core/audit.py

Lines 288 to 292 in 2b40c99

def _secret_generator(baseline):

"""Generates secrets to audit, from the baseline"""

for filename, secrets in baseline['results'].items():

for secret in secrets:

yield filename, secret

Right now we call the method containing codecs.open on every secret, when we only need to call it for each key in baseline['results'].

domanchi · 2019-07-16T05:19:58Z

FWIW, the audit functionality used line numbers as an optimization: there was no point re-scanning every line in the file, if you knew exactly which line to go to. Furthermore, since the baseline was ordered by increasing line number, the file would only have to be read once.

After all, the audit tool is an automated tool for the manual effort of opening each file, go to the line, find the secret yourself. These changes seem to cause a performance hit, and the benefit is not super clear.

OiCMudkips · 2019-07-16T18:38:16Z

As an aside, I don't know if the file is only read once as the code is written now?

In audit audit_baseline, we _print_context for every secret. _print_context calls _get_secret_with_context calls CodeSnippetHighlighter. get_code_snippet calls CodeSnippetHighlighter. _get_lines_in_file which open the whole file. I didn't see any caching in this chain.

Doing some Googling, it seems like there's a standard module for opening a single line but it doesn't seem to be in use here. I don't know that we actually want to include this module considering it seems to be couple with the traceback module and that it's only >= py35.

To clarify, the benefit here is to fix #189. Specifically, the high-entropy plugin has filetype-specific logic that was lost in the audit mode, since audit calls a different function from scan to find secrets on a line, and this PR tries to fix that by just calling the plugin's analyze function, which has all the context.

I agree that this change is opening and scanning the whole file, and this is a performance hit versus the status quo of opening the whole file and scanning 1 line. Let me think of a way to avoid this.

OiCMudkips · 2019-07-16T18:45:34Z

I think @KevinHock also has some reasons for why we would potentially want this in the comments of #189.

KevinHock · 2019-07-17T23:32:27Z

While I agree we should check the performance of this, / improve it however we can, I think the benefits of this make it worth doing:

Firstly, the statements

there was no point re-scanning every line in the file, if you knew exactly which line to go to.

and

The line number does not play a part in the identification of a potential secret because code is expected to move around through continuous iteration.

Kind of contradict each other. Even if we explicitly call it out in our docs, it is beneficial to some degree to not rely on line numbers for auditing, and to be able to not have to tell the user about something somewhat unintuitive.

Secondly, it is blocking the only complicated part of baseline diff minimization (#92), which is a good option to have for an internal reason, and we've had at least one external person ask us for it.

Lastly and most importantly, it will improve maintainability due to preventing any bugs like #189 in the future.

As an aside, people may very well scan thousands of repos in a cron like fashion with detect-secrets, however nobody audit's thousands of repos in a cron like fashion, so perf when scanning is significantly more important.

KevinHock

lgtm, just the one _secret_generator comment change left I think, that @domanchi pointed out.

KevinHock · 2019-07-17T23:42:40Z

detect_secrets/plugins/high_entropy_strings.py

@@ -203,7 +203,7 @@ def _analyze_yaml_file(self, file, filename):
                item = to_search.pop()

                try:
-                    if '__line__' in item and not item['__line__'] in ignored_lines:


KevinHock · 2019-07-17T23:46:30Z

tests/core/audit_test.py

-    def get_audited_baseline(self, plugin_config, is_secret):
+    def get_audited_baseline(
+        self,
+        plugins_used=[{'name': 'HexHighEntropyString'}],


I'm ack'ing that I saw the default keyword arg value was a list, and it seems apropos in this case. (That is a great picture by the way, @kennethreitz)

KevinHock · 2019-07-17T23:48:53Z

tests/core/audit_test.py

+            # NOTE: The first config here needs to be
+            # the HexHighEntropyString config for this test to work.
+            [{'name': 'HexHighEntropyString'}],  # plugin w/o config
+            [{'name': 'HexHighEntropyString', 'hex_limit': 2}],  # plguin w/config


s/plguin/plugin/g

Whoops, fixed

KevinHock · 2019-07-18T00:07:33Z

detect_secrets/core/audit.py

-            secret=raw_secret,
-        )
+    with codecs.open(filename, encoding='utf-8') as f:
+        plugin_secrets = plugin.analyze(f, filename)


It seems like we can avoid this by either changing or not using _secret_generator before calling this method

detect-secrets/detect_secrets/core/audit.py

Lines 288 to 292 in 2b40c99

def _secret_generator(baseline):

"""Generates secrets to audit, from the baseline"""

for filename, secrets in baseline['results'].items():

for secret in secrets:

yield filename, secret

Right now we call the method containing codecs.open on every secret, when we only need to call it for each key in baseline['results'].

This reduces the flakiness of the code coverage check. In particular, this code was covered in py37 but not previous versions. Also, this just makes it easier to pass the coverage test.

In audit mode, we were using secret_generator to find the plaintext secret. This was problematic because it led to inconsistences between what the scan functionality would find and what the audit functionality would find, leading to user errors. Using the same plugin function for both scan and audit will lead to fewer user errors.

OiCMudkips · 2019-08-01T23:36:48Z

This branch is now based off #213 and #213 should be reviewed before this.

OiCMudkips · 2019-08-01T23:46:53Z

With this in mind, you actually want to review 0d499fd (Make a line of code more pythonic) and later commits.

domanchi

Looks like we're scanning the entire file per secret audited still, but at least we save on disk IO. IIRC, our plugins are generally fast enough, so WFM I guess.

Might be worth ticketing for a P3 improvement.

OiCMudkips · 2019-08-06T22:37:46Z

In c8b60c7 I added a new test for the force-show-secret feature. This was because fc52867 failing the coverage test in Python 2. But, this test didn't actually help me increase coverage. The actual fix was to add (object) to CodeSnippet's declaration.

Why wasn't Python 3 failing coverage tests too? This was biting us but it only applies for Python 2.

OiCMudkips requested a review from KevinHock July 15, 2019 22:42

OiCMudkips self-assigned this Jul 15, 2019

domanchi reviewed Jul 16, 2019

View reviewed changes

KevinHock reviewed Jul 18, 2019

View reviewed changes

OiCMudkips force-pushed the fix_unfound_highentropy_secret branch from db8010e to 37721d4 Compare July 19, 2019 20:42

OiCMudkips mentioned this pull request Jul 22, 2019

Improvement: Cache last-opened files #210

Closed

KevinHock force-pushed the master branch from 81e2a44 to 6a3f206 Compare July 23, 2019 23:51

Victor Zhou added 5 commits August 1, 2019 15:58

Reduce number of file reads in audit and scan runs

9c23c78

Ignore code coverage in bidirectional iterator

9e97624

This reduces the flakiness of the code coverage check. In particular, this code was covered in py37 but not previous versions. Also, this just makes it easier to pass the coverage test.

Make a line of code more pythonic

0d499fd

Add tests to increase code coverage

fc52867

OiCMudkips force-pushed the fix_unfound_highentropy_secret branch from 37721d4 to fc52867 Compare August 1, 2019 23:35

domanchi approved these changes Aug 2, 2019

View reviewed changes

Add test for force-printing context of a secret

c8b60c7

OiCMudkips mentioned this pull request Aug 6, 2019

Consider how to stop analyzing the whole file in audit #218

Open

OiCMudkips merged commit c3ddf38 into Yelp:master Aug 6, 2019

killuazhu pushed a commit to IBM/detect-secrets that referenced this pull request May 28, 2020

fix: extract common logic (Yelp#208)

5d4a7d2

killuazhu pushed a commit to IBM/detect-secrets that referenced this pull request Jul 9, 2020

fix: extract common logic (Yelp#208)

a5629b5

KevinHock mentioned this pull request Aug 30, 2020

Faster raw secret fetch during audit #332

Open

killuazhu pushed a commit to IBM/detect-secrets that referenced this pull request Sep 17, 2020

fix: extract common logic (Yelp#208)

996ba02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use plugin `analyze` function in audit functionality #208

Use plugin `analyze` function in audit functionality #208

OiCMudkips commented Jul 15, 2019 •

edited

Loading

domanchi Jul 16, 2019

KevinHock Jul 18, 2019 •

edited

Loading

domanchi commented Jul 16, 2019

OiCMudkips commented Jul 16, 2019 •

edited

Loading

OiCMudkips commented Jul 16, 2019

KevinHock commented Jul 17, 2019 •

edited

Loading

KevinHock left a comment

KevinHock Jul 17, 2019

KevinHock Jul 17, 2019

KevinHock Jul 17, 2019

OiCMudkips Jul 19, 2019

KevinHock Jul 18, 2019 •

edited

Loading

OiCMudkips commented Aug 1, 2019

OiCMudkips commented Aug 1, 2019

domanchi left a comment

OiCMudkips commented Aug 6, 2019 •

edited

Loading

	def _secret_generator(baseline):
	"""Generates secrets to audit, from the baseline"""
	for filename, secrets in baseline['results'].items():
	for secret in secrets:
	yield filename, secret

Use plugin analyze function in audit functionality #208

Use plugin analyze function in audit functionality #208

Conversation

OiCMudkips commented Jul 15, 2019 • edited Loading

Old algo

New algo

Benefits

Why not just run scan over the whole codebase again?

Why are there also random new tests?

Notes for reviewing

domanchi Jul 16, 2019

Choose a reason for hiding this comment

KevinHock Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

domanchi commented Jul 16, 2019

OiCMudkips commented Jul 16, 2019 • edited Loading

OiCMudkips commented Jul 16, 2019

KevinHock commented Jul 17, 2019 • edited Loading

KevinHock left a comment

Choose a reason for hiding this comment

KevinHock Jul 17, 2019

Choose a reason for hiding this comment

KevinHock Jul 17, 2019

Choose a reason for hiding this comment

KevinHock Jul 17, 2019

Choose a reason for hiding this comment

OiCMudkips Jul 19, 2019

Choose a reason for hiding this comment

KevinHock Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

OiCMudkips commented Aug 1, 2019

OiCMudkips commented Aug 1, 2019

domanchi left a comment

Choose a reason for hiding this comment

OiCMudkips commented Aug 6, 2019 • edited Loading

Use plugin `analyze` function in audit functionality #208

Use plugin `analyze` function in audit functionality #208

OiCMudkips commented Jul 15, 2019 •

edited

Loading

Why not just run `scan` over the whole codebase again?

KevinHock Jul 18, 2019 •

edited

Loading

OiCMudkips commented Jul 16, 2019 •

edited

Loading

KevinHock commented Jul 17, 2019 •

edited

Loading

KevinHock Jul 18, 2019 •

edited

Loading

OiCMudkips commented Aug 6, 2019 •

edited

Loading