Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance when scanning a non ini file with millions of lines #136

Closed
killuazhu opened this issue Feb 28, 2019 · 4 comments · Fixed by #147
Closed

Slow performance when scanning a non ini file with millions of lines #136

killuazhu opened this issue Feb 28, 2019 · 4 comments · Fixed by #147

Comments

@killuazhu
Copy link
Contributor

When scanning a non-ini file with more than 1 million lines, it would hang at line below.

(self._analyze_ini_file(add_header=True), configparser.Error,),

I'm able to trace back to configParse and found the following line is extremely inefficient to add all offending lines (essentially all the lines in the file) into the error message with string concatenation.

self.message += '\n\t[line %2d]: %s' % (lineno, line)

I did not have the patience to wait for the scan to finish, on my laptop it did hang for at least more than 10 minutes.

We need a more efficient way to scan large non-ini file.

@KevinHock
Copy link
Collaborator

I ran into this today as well, with a file that was ~250k lines.

@domanchi
Copy link
Contributor

I spoke with @KevinHock today about this, and decided to record conversation down, for posterity.

Historical Context

Initially, the ini parser was written in order to try and catch secrets that did not need quote marks around them -- namely, config files.

$ cat config.ini
[private]
key=secret

The issue is that there's no easy way to identify whether a file is a config file. File extensions don't work, because config files don't have a typical set of extensions that they correspond to. And, there's no special header file that identifies that a file is a config file. e.g. It's not like you could do:

$ file config.ini

Therefore, the only way to really identify whether a file is a config file is to try and parse it, and handle errors appropriately.

Issue

It seems that this approach runs into two performance hits:

  1. Needing to parse the entire file, with configparser, before having usable results.
  2. Error traceback construction for large files takes a long time (as @killuazhu pointed out)

Possible Solutions

1. Use the first N lines to try and determine whether a file is actually a config file

Credit to @KevinHock for this idea. Essentially, if the following conditions hold true, we may be able to identify whether a file is a config file by reading the first few lines.

a. The first N lines are a representative sample for the entire file, and
b. The first N lines are independently parseable as a config file by themselves.

If we're able to do this, then we would be able to optimize on both issues listed above, since you don't need to parse the entire file to determine whether a given file is suitable for ini file parsing.

Our issue is that we don't have a large enough sample set of config files to test out this method.

2. Try to use a different library for config file parsing

If we use a different library, we may be able to avoid that error traceback construction, and speed things along. Or similarly, we might be able to perform a special sub-classed invocation of configparser to avoid ParsingError recording every line of output.

3. Rethink how we approach config files

file_type_analyzers = (
(self._analyze_ini_file(), configparser.Error,),
(self._analyze_yaml_file, yaml.YAMLError,),
(super(HighEntropyStringsPlugin, self).analyze, Exception,),
(self._analyze_ini_file(add_header=True), configparser.Error,),
)

Maybe, there's a better way to do this, than trying to scan the ini file twice?

@KevinHock
Copy link
Collaborator

KevinHock commented Mar 21, 2019

We did a short-term solution, number 2 from @domanchi's comment, in the above referenced PRs. They are live in version 0.12.2.

Thanks again for making this issue, I'm gonna keep it open until we improve on it more completely.

@domanchi
Copy link
Contributor

Closing this issue, seeing that #187 has factual evidence that the changes made have been effective for long files.

We can separately track performance for files with long lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants