Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

add "unused-files" pruning mode #1614

Closed
MarkusTeufelberger opened this issue Feb 1, 2018 · 1 comment
Closed

add "unused-files" pruning mode #1614

MarkusTeufelberger opened this issue Feb 1, 2018 · 1 comment

Comments

@MarkusTeufelberger
Copy link

https://golang.github.io/dep/docs/Gopkg.toml.html#prune currently defines 3 pruning modes:

non-go just removes files that are not relevant to go anyways (e.g. .travis-ci.yml files, READMEs and so on)

go-tests also removes test files

unused-packages even removes some go code that is definitely not being used, since it is not referenced by the package that "owns" the vendor folder

I'd like to propose a pruning mode that is even more rigorous than unused-packages:

unused-files would remove every file that does not influence the (hash of the) resulting package that "owns" the vendor folder. This means that a package has to produce the same binary with a vendor folder that contains only unused-file packages as a package built with an unpruned vendor folder. It however must not be possible to remove any set of files/directorys/symlinks from a unused-files vendored dependency without also influencing the compilation result. Inversely this means that any file/directory/symlink that does not cause any change in the resulting binary has been removed. Even if it is in the same folder as an imported dependency, for example a go file that only contains comments or some classes that are not used and thus stripped away later would not be vendored in the first place with this strategy.

A naive approach would be to do an initial measurement with an unpruned vendor folder, get a list of files/folders/symlinks of the folder to be pruned and run a ddmin algorithm (e.g. https://github.com/dgryski/go-ddmin) over that list with the criterion that the binary hash must still be the same as the initial one. The remaining list of files/folders/symlinks is then not guaranteed to be a global minimum unfortunately, but it would be least 1-minimal (removing any single entry from that list would change the outcome). This can be sped up by various heuristics (e.g. it is VERY likely that pruning with the existing strategies first would shrink the list of potential files/folders/symlinks to be removed already down considerably while pure ddmin would struggle a while).

The advantages compared to the existing pruning strategies:

  • No need to decide if/how/which tests, testdata or symlinks are relevant, there is an objective measurement to decide if they are necessary
  • Minimizes the amount of data that needs to be checked in while keeping files unmodified (only rewriting code - e.g. stripping comments - would get any smaller than this)
  • Guarantees the same behavior as the unpruned version (the current strategies just assume this property I guess?)
  • The same strategy could be used to identify dead code/unused files in the actual code base too.
  • Fewer issues with build systems like Bazel that try to build every go file that they find (and then fail due to optional dependencies of vendored code - take a look at In vendor and external repos, only generate rules needed to resolve dependencies bazelbuild/bazel-gazelle#93 for an example what kind of pain this causes)

Downsides:

  • Computationally expensive if done by (re)compiling and comparing hashes
  • If more features of a vendored dependency are being used, it might become necessary to first get an unpruned version from upstream and then re-prune (this is already a potential issue with unused-files too)
  • You are only guaranteed to have the code you currently need available in your vendor folder, not a full "insurance" against vanishing upstreams (this is a general issue with pruning)
@sdboyer
Copy link
Member

sdboyer commented Jul 23, 2018

hi! Sorry i didn't respond to this earlier - after releasing v0.4.1, my attention turned entirely to rebutting vgo.

This is an interesting approach, but the computational costs would likely undermine its utility for the big wins with verification, as now realized in #1912.

i'll keep this in mind for the successor tool, though!

@sdboyer sdboyer closed this as completed Jul 23, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants