-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow performance when scanning a non ini file with millions of lines #136
Comments
I ran into this today as well, with a file that was ~250k lines. |
I spoke with @KevinHock today about this, and decided to record conversation down, for posterity. Historical ContextInitially, the ini parser was written in order to try and catch secrets that did not need quote marks around them -- namely, config files.
The issue is that there's no easy way to identify whether a file is a config file. File extensions don't work, because config files don't have a typical set of extensions that they correspond to. And, there's no special header file that identifies that a file is a config file. e.g. It's not like you could do:
Therefore, the only way to really identify whether a file is a config file is to try and parse it, and handle errors appropriately. IssueIt seems that this approach runs into two performance hits:
Possible Solutions1. Use the first N lines to try and determine whether a file is actually a config fileCredit to @KevinHock for this idea. Essentially, if the following conditions hold true, we may be able to identify whether a file is a config file by reading the first few lines. a. The first N lines are a representative sample for the entire file, and If we're able to do this, then we would be able to optimize on both issues listed above, since you don't need to parse the entire file to determine whether a given file is suitable for ini file parsing. Our issue is that we don't have a large enough sample set of config files to test out this method. 2. Try to use a different library for config file parsingIf we use a different library, we may be able to avoid that error traceback construction, and speed things along. Or similarly, we might be able to perform a special sub-classed invocation of 3. Rethink how we approach config filesdetect-secrets/detect_secrets/plugins/high_entropy_strings.py Lines 59 to 64 in 1fabf92
Maybe, there's a better way to do this, than trying to scan the ini file twice? |
We did a short-term solution, number 2 from @domanchi's comment, in the above referenced PRs. They are live in version 0.12.2. Thanks again for making this issue, I'm gonna keep it open until we improve on it more completely. |
Closing this issue, seeing that #187 has factual evidence that the changes made have been effective for long files. We can separately track performance for files with long lines. |
When scanning a non-ini file with more than 1 million lines, it would hang at line below.
I'm able to trace back to configParse and found the following line is extremely inefficient to add all offending lines (essentially all the lines in the file) into the error message with string concatenation.
I did not have the patience to wait for the scan to finish, on my laptop it did hang for at least more than 10 minutes.
We need a more efficient way to scan large non-ini file.
The text was updated successfully, but these errors were encountered: