Slow performance when scanning a non ini file with millions of lines #136

killuazhu · 2019-02-28T22:27:18Z

When scanning a non-ini file with more than 1 million lines, it would hang at line below.

(self._analyze_ini_file(add_header=True), configparser.Error,),

I'm able to trace back to configParse and found the following line is extremely inefficient to add all offending lines (essentially all the lines in the file) into the error message with string concatenation.

self.message += '\n\t[line %2d]: %s' % (lineno, line)

I did not have the patience to wait for the scan to finish, on my laptop it did hang for at least more than 10 minutes.

We need a more efficient way to scan large non-ini file.

KevinHock · 2019-03-16T00:15:37Z

I ran into this today as well, with a file that was ~250k lines.

domanchi · 2019-03-20T23:07:44Z

I spoke with @KevinHock today about this, and decided to record conversation down, for posterity.

Historical Context

Initially, the ini parser was written in order to try and catch secrets that did not need quote marks around them -- namely, config files.

$ cat config.ini
[private]
key=secret

The issue is that there's no easy way to identify whether a file is a config file. File extensions don't work, because config files don't have a typical set of extensions that they correspond to. And, there's no special header file that identifies that a file is a config file. e.g. It's not like you could do:

$ file config.ini

Therefore, the only way to really identify whether a file is a config file is to try and parse it, and handle errors appropriately.

Issue

It seems that this approach runs into two performance hits:

Needing to parse the entire file, with configparser, before having usable results.
Error traceback construction for large files takes a long time (as @killuazhu pointed out)

Possible Solutions

1. Use the first N lines to try and determine whether a file is actually a config file

Credit to @KevinHock for this idea. Essentially, if the following conditions hold true, we may be able to identify whether a file is a config file by reading the first few lines.

a. The first N lines are a representative sample for the entire file, and
b. The first N lines are independently parseable as a config file by themselves.

If we're able to do this, then we would be able to optimize on both issues listed above, since you don't need to parse the entire file to determine whether a given file is suitable for ini file parsing.

Our issue is that we don't have a large enough sample set of config files to test out this method.

2. Try to use a different library for config file parsing

If we use a different library, we may be able to avoid that error traceback construction, and speed things along. Or similarly, we might be able to perform a special sub-classed invocation of configparser to avoid ParsingError recording every line of output.

3. Rethink how we approach config files

detect-secrets/detect_secrets/plugins/high_entropy_strings.py

Lines 59 to 64 in 1fabf92

    
           file_type_analyzers = ( 
        
               (self._analyze_ini_file(), configparser.Error,), 
        
               (self._analyze_yaml_file, yaml.YAMLError,), 
        
               (super(HighEntropyStringsPlugin, self).analyze, Exception,), 
        
               (self._analyze_ini_file(add_header=True), configparser.Error,), 
        
           )

Maybe, there's a better way to do this, than trying to scan the ini file twice?

KevinHock · 2019-03-21T22:57:03Z

We did a short-term solution, number 2 from @domanchi's comment, in the above referenced PRs. They are live in version 0.12.2.

Thanks again for making this issue, I'm gonna keep it open until we improve on it more completely.

domanchi · 2019-05-30T04:14:50Z

Closing this issue, seeing that #187 has factual evidence that the changes made have been effective for long files.

We can separately track performance for files with long lines.

KevinHock added performance labels Mar 19, 2019

This was referenced Mar 21, 2019

Monkey patch configparser with EfficientParsingError #144

Merged

Patch backports configparser too #147

Merged

This was referenced May 22, 2019

Add performance benchmark #181

Merged

Adding baseline functionality for benchmark script #187

Merged

domanchi closed this as completed May 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance when scanning a non ini file with millions of lines #136

Slow performance when scanning a non ini file with millions of lines #136

killuazhu commented Feb 28, 2019

KevinHock commented Mar 16, 2019

domanchi commented Mar 20, 2019

KevinHock commented Mar 21, 2019 •

edited

Loading

domanchi commented May 30, 2019

Slow performance when scanning a non ini file with millions of lines #136

Slow performance when scanning a non ini file with millions of lines #136

Comments

killuazhu commented Feb 28, 2019

KevinHock commented Mar 16, 2019

domanchi commented Mar 20, 2019

Historical Context

Issue

Possible Solutions

1. Use the first N lines to try and determine whether a file is actually a config file

2. Try to use a different library for config file parsing

3. Rethink how we approach config files

KevinHock commented Mar 21, 2019 • edited Loading

domanchi commented May 30, 2019

KevinHock commented Mar 21, 2019 •

edited

Loading