Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue scanning large files #552

Closed
QSilver opened this issue May 6, 2022 · 0 comments
Closed

Performance issue scanning large files #552

QSilver opened this issue May 6, 2022 · 0 comments

Comments

@QSilver
Copy link
Contributor

QSilver commented May 6, 2022

snippet=list(itertools.islice(lines, start_line_index, end_line_index)),

islice iterates through the array when slicing.
As this is called for every line in the file, it leads to O(n^2) and performance degradation for large files

Pyinstrument profiling output for 101MB file with itertools.islice:
12138 seconds

12138.581 :1
[9 frames hidden] , runpy, pkgutil, <frozen zip...
12138.574 _run_code runpy.py:63
└─ 12138.574 DSTEST.py:1
└─ 12138.066 scan_file detect_secrets\core\secrets_collection.py:74
└─ 12138.040 scan_file detect_secrets\core\scan.py:140
└─ 12134.828 _process_line_based_plugins detect_secrets\core\scan.py:297
└─ 12036.388 get_code_snippet detect_secrets\util\code_snippet.py:9
└─ 12036.140 [self]

Changing that to basic python array slice (lines[start_line_index:end_line_index]):
817 seconds

817.007 :1
[10 frames hidden] , runpy, , pkgutil
817.004 _run_code runpy.py:63
└─ 817.004 DSTEST.py:1
└─ 816.673 scan_file detect_secrets\core\secrets_collection.py:74
└─ 816.646 scan_file detect_secrets\core\scan.py:140
└─ 814.621 _process_line_based_plugins detect_secrets\core\scan.py:297
├─ 543.436 detect_secrets\core\scan.py:322
│ ├─ 527.226 _scan_line detect_secrets\core\scan.py:337
│ │ ├─ 493.026 call_function_with_arguments detect_secrets\util\inject.py:11
│ │ │ ├─ 127.814 analyze_line detect_secrets\plugins\high_entropy_strings.py:43
│ │ │ │ ├─ 70.720 detect_secrets\plugins\high_entropy_strings.py:56
│ │ │ │ │ ├─ 38.764 calculate_shannon_entropy detect_secrets\plugins\high_entropy_strings.py:75
│ │ │ │ │ │ ├─ 23.113 [self]
│ │ │ │ │ │ └─ 10.169 str.count :0
│ │ │ │ │ │ [2 frames hidden]
│ │ │ │ │ └─ 29.680 calculate_shannon_entropy detect_secrets\plugins\high_entropy_strings.py:161
│ │ │ │ │ └─ 26.267 calculate_shannon_entropy detect_secrets\plugins\high_entropy_strings.py:75
│ │ │ │ │ └─ 15.567 [self]
│ │ │ │ └─ 52.968 analyze_line detect_secrets\plugins\base.py:44
│ │ │ │ ├─ 21.440 init detect_secrets\core\potential_secret.py:24
│ │ │ │ │ └─ 17.459 set_secret detect_secrets\core\potential_secret.py:55
│ │ │ │ │ └─ 14.998 hash_secret detect_secrets\core\potential_secret.py:68
│ │ │ │ ├─ 12.699 analyze_string detect_secrets\plugins\high_entropy_strings.py:32
│ │ │ │ │ └─ 8.819 Pattern.findall :0
│ │ │ │ │ [2 frames hidden]
│ │ │ │ └─ 10.348 hash detect_secrets\core\potential_secret.py:126
│ │ │ ├─ 126.483 analyze_line detect_secrets\plugins\keyword.py:292
│ │ │ │ ├─ 100.908 analyze_line detect_secrets\plugins\base.py:44
│ │ │ │ │ └─ 98.989 analyze_string detect_secrets\plugins\keyword.py:266
│ │ │ │ │ └─ 94.660 Pattern.search :0
│ │ │ │ │ [2 frames hidden]
│ │ │ │ └─ 21.223 determine_file_type detect_secrets\util\filetype.py:27
│ │ │ │ └─ 13.458 [self]
│ │ │ ├─ 90.990 analyze_line detect_secrets\plugins\base.py:44
│ │ │ │ ├─ 65.746 analyze_string detect_secrets\plugins\base.py:145
│ │ │ │ │ ├─ 44.914 Pattern.findall :0
│ │ │ │ │ │ [2 frames hidden]
│ │ │ │ │ └─ 20.832 [self]
│ │ │ │ └─ 19.914 [self]
│ │ │ ├─ 79.141 [self]
│ │ │ ├─ 22.463 make_function_self_aware detect_secrets\util\inject.py:41
│ │ │ │ └─ 13.463 [self]
│ │ │ ├─ 20.458 ismethod inspect.py:199
│ │ │ │ [4 frames hidden] inspect,
│ │ │ └─ 16.819 detect_secrets\util\inject.py:33
│ │ └─ 34.199 [self]
│ └─ 16.210 [self]
├─ 255.036 _is_filtered_out detect_secrets\core\scan.py:369
│ └─ 242.291 call_function_with_arguments detect_secrets\util\inject.py:11
│ ├─ 171.184 is_indirect_reference detect_secrets\filters\heuristic.py:158
│ │ └─ 168.933 Pattern.search :0
│ │ [2 frames hidden]
│ ├─ 47.377 is_line_allowlisted detect_secrets\filters\allowlist.py:13
│ │ ├─ 20.990 Pattern.search :0
│ │ │ [2 frames hidden]
│ │ ├─ 16.885 _get_allowlist_regexes_for_file detect_secrets\filters\allowlist.py:53
│ │ └─ 8.224 [self]
│ └─ 14.737 [self]
└─ 8.839 [self]

QSilver added a commit to QSilver/detect-secrets that referenced this issue May 19, 2022
Yelp#552

Improving performance for array slice
jpdakran pushed a commit that referenced this issue May 24, 2022
* Addressing issue 552:
#552

Improving performance for array slice

* Removing unused import
@QSilver QSilver closed this as completed May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant