Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide audit hook events during package installation #8938

Open
RootLUG opened this issue Sep 29, 2020 · 12 comments
Open

Provide audit hook events during package installation #8938

RootLUG opened this issue Sep 29, 2020 · 12 comments
Labels
state: awaiting PR Feature discussed, PR is needed type: feature request Request for a new feature

Comments

@RootLUG
Copy link

RootLUG commented Sep 29, 2020

What's the problem this feature will solve?
This feature will enable third party tools to intercept package installations that could provide features such as:

  • audit installed code for security purposes before any code execution takes place (such as running setup.py which could contain malware)
  • verify installed packages, for example, code signing wheels or individual packages (PGP or other mechanisms)
  • check for typosquatting or against blacklist/whitelist of packages

This is just an example list of features that the third party implementations can provide by listening to the installation audit hooks.

Describe the solution you'd like
pip can leverage the functionality as described in PEP 578 to fire (custom) sys audit hooks before installation takes place which is native in Python 3.8+ and aims to provide visibility into the Python process for external monitoring systems.

The audit hook (for example "pip.install") should also pass metadata about the installation in the same manner (tuple of arguments) as the builtin python hooks are already doing

Proposed example of audit arguments:

  • audit hook or pip version number as first argument so the format can be changed or extended in future pip versions if needed
  • type of the installed package which would allow differentiating between wheels, sdists, local installs from a file/directory, git repositories etc...
  • name of the package
  • URL of the package
  • package specifiers such as "==2.0.1" parsed from the requirements file
  • dependency chain (x, y, z, ...) where x is the package being installed (name) and y is the parent package which has x as the dependency, z has a dependency y and so on up to the root level/package
  • hash (from requirements file)
  • filename such as simplewheel-1.0-py2.py3-none-any.whl
  • local path to the downloaded file (preferable), or directory that contains the package files that is going to be installed (e.g. directory that contains setup.py, pyproject.toml etc..)
  • additional flags (int combined using bitwise or) to denote attributes such as editable install, is_pinned, update (package is being updated)

Since not all installation/invocation methods provide the necessary attributes, those that are missing (for example hash of the package) should be replaced with None if not available. All of the proposed example metadata arguments are already available in the InstallRequirement class from which they can be extracted into the tuple and passed to the audit hook.

These arguments should be sufficient for external monitoring tool/listening audit hook to make an informed decision about the installed package and prevent package installation by raising an exception inside the hook handler to prevent the installation of the package and any code execution e.g. running setup.py. As denoted in PEP, the exception raised inside the audit hook should not be catched by pip and just propagated further resulting in an unhandled exception, maybe including cleanup of temporary data created by pip?

There could be also other audit hooks fired by pip such as uninstall of a package or pip invocation itself (e.g. pip.invoked with sys.argv as audit tuple arguments)

Alternative Solutions
entry points would allow for almost the same functionality however there might be few additional problems related to that. The first is speed as the entrypoints would need to be imported during pip invocation which could add to delay. Exceptions thrown during that time (entrypoint import) could also cause pip to crash if not handled properly. I believe the system audit hooks are superior as they fit nicely into the python ecosystem since that is the reason why the audit hooks were designed in the first place and avoid reinventing the wheel. Also, the cited PEP would provide better reference implementation over decisions such as the above-mentioned exception throwing.

Additional context
There were already similar tickets or discussions about providing a "plugin" or "hook" functionality that allows to extend the installation process or gives visibility/auditing into packages that are going to be installed. The closes feature probably being #1035

There are few distinctions in the previous discussion/feature request vs. firing an audit hook. The discussion In that ticket got steered how the signature verification should be correctly implemented since getting the cryptography right is difficult and the same could be argued about the audit hooks, however, they are not designed to provide security mechanisms or sandboxing but merely just visibility into the blackbox that Python is and the same principle can be applied to pip installing packages which is a de-facto default tool in all modern Python installations.

I understand the reluctance to provide a public API as that brings problems with maintainability. That could be improved or made better by selecting different kind of attributes that are passed as metadata to the audit hook and with a combination of version numbers future proof for any potential changes that might occur. Alternatively the maintainability problem could be resolved almost completely by just passing (<pip_version>, <pip._internal.req.InstallRequirement object instance>) as the hook arguments and leaving the extraction of the necessary information to the monitoring system itself.

@uranusjr
Copy link
Member

For anyone wanting to ask in the future, the API pip would need to use is sys.audit()

@uranusjr uranusjr added the type: feature request Request for a new feature label Sep 29, 2020
@dstufft
Copy link
Member

dstufft commented Sep 30, 2020

I wonder if it makes sense to make this a pip feature, or should we have a PEP? I imagine a number of these audit hooks would make sense not just for pip to provide them, but for other tools too. However I've never really used the Python audit hooks stuff to know if that's a crazy idea or not.

@pfmoore
Copy link
Member

pfmoore commented Sep 30, 2020

This feature will enable third party tools to intercept package installations

That line says to me that it should be a standard, as the relevant tools would quite probably want to be able to catch conda install, for example, as well as pip installs.

If it were to be a PEP, I'd propose something like "Standard Audit Events for Package Installations", and define what information tools should audit (and specifying that the auditing should be via the core Python APIs), leaving it to individual tools like pip to decide how to implement it.

Beyond that, though, I don't really have much of an opinion, as I'm not that familiar with how or why auditing tools handle this sort of thing.

@RootLUG
Copy link
Author

RootLUG commented Oct 1, 2020

Agreed that it would be very beneficial to have this standardized as PEP since other alternatives such as poetry could use the same mechanism. I personally don't have any experience in what the process of creating and proposing PEP is so I can't probably do it alone but I am more than happy to help such as researching the internals of other package managers so we can define the attribute set as universal as possible or anything else. Is there anything I can do to help or implement from my side?

@pradyunsg
Copy link
Member

@RootLUG A good place to start would be figuring out what would be good points for the hook -- i.e. at which point in the process of handling a package that pip would call the audit hook.

@RootLUG
Copy link
Author

RootLUG commented Oct 6, 2020

Alright so I took a look at pip internals and found that in my opinion, the ideal location for inserting the audit hook would be

scheme = get_scheme(

There were also other locations that I found but haven't passed testing where I considered the following:

  • hook is fired before any code execution from the package takes place (e.g. executing setup.py as it can contain malware or code of interest)
  • it is called every time regardless of the type of installation (package upgrade, new package install, installation of local files/packages)
  • hook is called each time even if package installation is aborted

The last point was causing a little bit of trouble as I found some potential entry points where the hook could be inserted but failed that condition. To explain a little bit more; pip is run to install a package pip install something but the installation is aborted (^C by user or exception raised inside the audit hook causing installation abort). pip install something is then executed again and the audit hook should be fired as well. Other locations for inserting hook I found were skipped due to caching mechanisms such as file from PyPI already downloaded and/or unpacked and the location I posted was the only one I was able to found that passed all these conditions. Ideally, the hook should be as close to the actual installation phase as possible as some properties of the InstallRequirement object are filled in later so the more information that can be passed to the auditing system the better. That code part is the last central part before the installation steps branch off to different install types (wheels, sdists, etc...)

Tested using tox and the following matrix of environments: py{27,38}-pip{10,18,19,20,20.2}
I included the previous pip versions just to check if there were any frequent changes in pip internals that could have broken the behavior hinting a problem with maintainability in the future.

Are there any other conditions that I should check for at that location?

@pradyunsg
Copy link
Member

This hook would be fired after we run setup.py egg_info.

@pradyunsg
Copy link
Member

Consider looking at #6607 (comment), for context on how pip handles packages. Basically everything marked "install", "legacy", "modern" or "develop-install" in that graph results in code execution (except for wheel installs).

@zooba
Copy link
Contributor

zooba commented Nov 29, 2021

FWIW, I wouldn't worry too much about standardising audit events across tools. They're low-level enough that you should only worry about pip's.

Name them pip.<whatever>, list them in your own docs, and really avoid changing the arguments (since consumers will likely write in C and when the arguments change it can be very hard to track down the cause of the exceptions). Anyone trying to make use of them will gladly handle multiple events from different sources - if you were to standardise them, you'd need to provide a "source" argument on every event anyway to know where it was coming from, so may as well just leave it in the event name.

@pradyunsg
Copy link
Member

Let's make these audit hooks pip-specific. If someone wants to pick this up, please say so here and let us know how you're thinking of implementing this! :)

@pradyunsg pradyunsg added the state: awaiting PR Feature discussed, PR is needed label Mar 26, 2022
@steve-s
Copy link

steve-s commented Mar 28, 2022

It would be great if the this hook also gets argument with a local path to the directory with sources that are (verbatim) going to be used for the following installation process. For example, if the installation requires extracting some archive and then installing the package from that, the argument would be path to these extracted sources.

The reasons I am suggesting this:

  • the audit can go further and inspect that the archive was extracted to "non malicious" bits. I think that, in general, the closer the audit gets to the point of "these exact bits will be installed" the better.
  • the audit would not have to worry about archive formats if all it wants is to, for example, scan all the sources for some malicious code pattern
  • it could be also used to patch the sources before installation. In GraalPython we maintain patches for some Python packages to deal with some incompatibilities and unfortunately we have to patch pip itself to patch them before the installation.

@zooba
Copy link
Contributor

zooba commented Mar 28, 2022

Events are cheap, no reason not to raise them before/after each major operation. For anyone watching logs, it'll also help correlate the rush of filesystem events that get logged in between, so they can tell that they're related to a specific pip operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: awaiting PR Feature discussed, PR is needed type: feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

7 participants