Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yaml tranformer fails on a certain case where the file ends with a blank line #374

Closed
gpflaum opened this issue Nov 25, 2020 · 11 comments · Fixed by #501
Closed

yaml tranformer fails on a certain case where the file ends with a blank line #374

gpflaum opened this issue Nov 25, 2020 · 11 comments · Fixed by #501

Comments

@gpflaum
Copy link
Contributor

gpflaum commented Nov 25, 2020

This happens with the pre-v1-launch branch, not reproducible with the master branch.

When a YAML file ends with a block style list of flow style maps followed by a blank line, scanning fails with "list index out of range" at line=lines[item['__line__'] - 1] because __line__ is n but the lines list has only n-1 items, no element for the final blank line.

Example:

foo:
  - { bar: 'baz' }

When it fails, __line__ is 3, but lines has only two items: ['foo:', ' - { bar: "baz" }']:

list index out of range
  File "/Users/gpflaum/ws/detect-secrets-pre-v1-launch/detect_secrets/transformers/yaml.py", line 193, in __iter__
    line=lines[item['__line__'] - 1],
  File "/Users/gpflaum/ws/detect-secrets-pre-v1-launch/detect_secrets/transformers/yaml.py", line 33, in parse_file
    items = sorted(YAMLFileParser(file), key=lambda x: x.line_number)
  File "/Users/gpflaum/ws/detect-secrets-pre-v1-launch/detect_secrets/core/scan.py", line 130, in _get_transformed_file
    return transformer.parse_file(file)
  File "/Users/gpflaum/ws/detect-secrets-pre-v1-launch/detect_secrets/core/scan.py", line 50, in scan_file
    lines = _get_transformed_file(f)
  File "/Users/gpflaum/ws/detect-secrets-pre-v1-launch/detect_secrets/core/secrets_collection.py", line 42, in scan_file
    for secret in scan.scan_file(filename):
  File "/Users/gpflaum/ws/detect-secrets-pre-v1-launch/detect_secrets/core/baseline.py", line 30, in create
    secrets.scan_file(filename)
  File "/Users/gpflaum/ws/detect-secrets-pre-v1-launch/detect_secrets/main.py", line 47, in handle_scan_action
    secrets = baseline.create(*args.path, should_scan_all_files=args.all_files)
  File "/Users/gpflaum/ws/detect-secrets-pre-v1-launch/detect_secrets/main.py", line 26, in main
    handle_scan_action(args)
  File "/Users/gpflaum/ws/detect-secrets-pre-v1-launch/detect_secrets/__main__.py", line 7, in <module>
    sys.exit(main())

Parsing is successful when the file doesn't end with a blank line:

foo:
  - { bar: 'baz' }]

Here are several cases that don't fail when the file ends with a blank line.

Block style map of block style list:

foo: 
  - 
    bar: baz

Flow style map that isn't in a list:

foo: { bar: 'baz' }

Block style list of flow style lists:

foo:
  - [ "bar", "baz" ]

Flow style list of flow style maps:

foo: [ { bar: 'baz' } ]

@domanchi
Copy link
Contributor

domanchi commented Nov 30, 2020

oof. Nice find. Thanks for providing great test cases to check if implemented logic is accurate.

Let me see what I can do.

EDIT: Actually, I was unable to reproduce this.

$ cat blah.yaml
foo:
  - { bar: 'baz' }

$ python -i
>>> from detect_secrets.transformers.yaml import YAMLFileParser
>>> with open('blah.yaml') as f:
...    results = YAMLFileParser(f).json()
>>> import json
>>> print(json.dumps(results, indent=2))
{
  "foo": [
    {
      "bar": {
        "__value__": "baz",
        "__line__": 3,
        "__original_key__": "bar"
      }
    }
  ]
}

I'm aware that the line number is off (known issue, on my backlog), but at least it doesn't throw a parsing error. I have also verified I don't have any staged changes that may be affecting this reproducibility.

@gpflaum
Copy link
Contributor Author

gpflaum commented Nov 30, 2020

Can reproduce it with this:

$ python -i
Python 3.8.0 (default, May  7 2020, 02:49:39) 
[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from detect_secrets.transformers.yaml import YAMLFileParser
>>> with open('blah.yaml') as f:
...     items = sorted(YAMLFileParser(f), key=lambda x: x.line_number)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/data/ws/detect-secrets/detect_secrets/transformers/yaml.py", line 193, in __iter__
    line=lines[item['__line__'] - 1],
IndexError: list index out of range

@domanchi
Copy link
Contributor

Oh, of course 🤦 . The __iter__ method isn't called until we sort it.

@domanchi
Copy link
Contributor

I took a quick peek at this -- I think this is going to be a lot harder to address than initially expected.

Why it Fails on Master

I was initially very surprised to see it failing on master, given that the core YAML parsing logic is mostly unchanged. I tested this side by side with a json dump, and confirmed my hypothesis -- they are near identical (at least, for the points that matter).

  • master branch
{
  "foo": [
    {
      "bar": {
        "__value__": "baz",
        "__line__": 3,
        "__is_binary__": false,
        "__original_key__": "bar"
      }
    }
  ]
}
  • v1 branch
{
  "foo": [
    {
      "bar": {
        "__value__": "baz",
        "__line__": 3,
        "__original_key__": "bar"
      }
    }
  ]
}
  • file content
foo:
    - { bar: 'baz' }
# this is a new line

It turns out that the v1 branch attempts to output the line itself, so that it can be handled in the filters logic (rather than the master branch's approach to let the YAML parser indicate which lines should be excluded), and as a result, it raises the IndexError since the newline is stripped from YAMLFileParser.content.

But the problem lies even deeper than this.

Incorrect Line Numbers

As I alluded to before, my local testing seem to suggest that YAML file parsing does not provide accurate line numbers, as previously designed. This bug is an artifact of this.

We can see this by extending the number of new lines within the file content itself, and checking the JSON output of YAMLFileParser. Apparently, it looks like within the YAMLFileParser._compose_node_shim function, the nested dictionary is tagged with the line number corresponding to the number of lines in the file itself!

Bizarre. But we must go deeper.

YAML Versioning Issue?

This YAMLFileParser operates on the assumption that the yaml loader knows the proper line number when parsing the file. It's debatable whether it has worked in the past (for this edge case specifically), but it certainly doesn't work now. The incorrect line numbers are obtained from the upstream's data source, so if we wanted to fix this, we need to read more core YAML parsing code, and figure out what has changed.

I'm currently able to reproduce this on pyyaml==5.1.2

Proposed Solution

I'm going to wait for my pending v1 branches to be merged in, before taking another look at this. My current thoughts are:

  • pin the YAML version (if an older version works better)
  • modify the parsing code to extract the proper line numbers based on upstream library

@nathankooij
Copy link

nathankooij commented Mar 29, 2021

@domanchi we've also hit several .yaml files that cause the secrets scanner to crash using v1.0.3 (see below for the traceback). Although it does not seem to be related to ending with a blank line (adding or removing it makes no difference), it does seem related to your analysis above. I was wondering would it be helpful if I provided some of these files (after I strip it of some company specific details) for debugging purposes? Also happy to open another issue for it instead.

Traceback (most recent call last):
  File "/usr/local/bin/detect-secrets", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/detect_secrets/main.py", line 30, in main
    handle_scan_action(args)
  File "/usr/local/lib/python3.9/site-packages/detect_secrets/main.py", line 66, in handle_scan_action
    secrets = baseline.create(*args.path, should_scan_all_files=args.all_files)
  File "/usr/local/lib/python3.9/site-packages/detect_secrets/core/baseline.py", line 26, in create
    secrets.scan_file(filename)
  File "/usr/local/lib/python3.9/site-packages/detect_secrets/core/secrets_collection.py", line 42, in scan_file
    for secret in scan.scan_file(filename):
  File "/usr/local/lib/python3.9/site-packages/detect_secrets/core/scan.py", line 141, in scan_file
    for lines in _get_lines_from_file(filename):
  File "/usr/local/lib/python3.9/site-packages/detect_secrets/core/scan.py", line 245, in _get_lines_from_file
    lines = get_transformed_file(cast(NamedIO, f))
  File "/usr/local/lib/python3.9/site-packages/detect_secrets/transformers/__init__.py", line 31, in get_transformed_file
    return transformer.parse_file(file)
  File "/usr/local/lib/python3.9/site-packages/detect_secrets/transformers/yaml.py", line 34, in parse_file
    items = sorted(YAMLFileParser(file), key=lambda x: x.line_number)
  File "/usr/local/lib/python3.9/site-packages/detect_secrets/transformers/yaml.py", line 194, in __iter__
    line=lines[item['__line__'] - 1],
IndexError: list index out of range

@domanchi
Copy link
Contributor

domanchi commented Apr 8, 2021

Yes, please do. I've hit some of these myself, but have been unable to reproduce them on the master branch (despite no relevant changes to this processing logic). My hypothesis is that it may have something to do with the PyYAML version. What version are you running with?

@nathankooij
Copy link

nathankooij commented Apr 15, 2021

Yes, please do. I've hit some of these myself, but have been unable to reproduce them on the master branch (despite no relevant changes to this processing logic). My hypothesis is that it may have something to do with the PyYAML version. What version are you running with?

Sorry for the late reply. I'm using PyYAML 5.4.1. The following test cases are made using 1.1.0.

We used a slightly modified version of a default Verdaccio configuration file here, but the default one as presented there also fails on my machine with:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/picnic/.venv/lib/python3.9/site-packages/detect_secrets/core/secrets_collection.py", line 294, in _scan_file_and_serialize
    return list(scan.scan_file(filename))
  File "/home/picnic/.venv/lib/python3.9/site-packages/detect_secrets/core/scan.py", line 150, in scan_file
    for lines in _get_lines_from_file(filename):
  File "/home/picnic/.venv/lib/python3.9/site-packages/detect_secrets/core/scan.py", line 254, in _get_lines_from_file
    lines = get_transformed_file(cast(NamedIO, f))
  File "/home/picnic/.venv/lib/python3.9/site-packages/detect_secrets/transformers/__init__.py", line 31, in get_transformed_file
    return transformer.parse_file(file)
  File "/home/picnic/.venv/lib/python3.9/site-packages/detect_secrets/transformers/yaml.py", line 34, in parse_file
    items = sorted(YAMLFileParser(file), key=lambda x: x.line_number)
  File "/home/picnic/.venv/lib/python3.9/site-packages/detect_secrets/transformers/yaml.py", line 201, in __iter__
    line=lines[item['__line__'] - 1],
IndexError: list index out of range

Another one that fails similarly is (note I replaced values to hide some details, but it still fails):

postgresql_restarted_state: state
postgresql_databases:
  - name: some_database
postgresql_global_config_options:
  - option: listen_addresses
    value: "addresses"
postgresql_hba_entries:
  - { type: type, database: database, user: user, address: "127.0.0.1", auth_method: method }

I had another test case but that seems to be fixed as of 1.1.0.

@dbslayer
Copy link

@domanchi , i also face the same issue. Is there any ETA when this fix will be released?

@ZobairQ
Copy link

ZobairQ commented May 20, 2021

I am having the same problem.. any fix on the way?

jimmyhlee94 pushed a commit to jimmyhlee94/detect-secrets that referenced this issue Aug 19, 2021
* Clarify setting PYTHONPATH in dev environment

Updates CONTRIBUTING.md with details about why PYTHONPATH needs to be set in one's dev environment. Explains the consequences of not doing so.

* Remove random bullet point from top of README
@tiagocmelo
Copy link

Same problem here, guys.

@domanchi
Copy link
Contributor

Hi all,

Realized I left this thread in an unfinished state. As of mid-April, I lost write-access to this repository which inadvertently halted all development for this project.

We're currently trying to figure out what's the future of this project to migrate it into a platform where development can continue again, but until then, I'd suggest that if you are having the same issue, you can patch your library yourself with the PR I referenced above.

@jpdakran jpdakran linked a pull request Jan 11, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants