Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipenv git repository too large due to stored wheels #3680

Closed
con-f-use opened this issue Apr 5, 2019 · 5 comments
Closed

Pipenv git repository too large due to stored wheels #3680

con-f-use opened this issue Apr 5, 2019 · 5 comments
Labels
Category: Development Issue affects development workflow. Category: Tests Relates to tests. Priority: Medium This item is medium priority and will be resolved whenever possible. Type: Discussion This issue is open for discussion. Type: Enhancement 💡 This is a feature or enhancement request.

Comments

@con-f-use
Copy link

con-f-use commented Apr 5, 2019

I noticed the pipenv git repository takes a long time to clone because of it's size. This is because a number of wheels (binary data) is stored directly in the repo rather than using git LFS or other means.

The problem will only grow with time, when different versions of the wheels get committed, because the old ones will still be part of the repo and git cannot make smart diffs with binary data as it can with text.

Please find another solution to storing wheels for tests.

$ git clone [email protected]:pypa/pipenv.git .
Cloning into 'pipenv'...
remote: Enumerating objects: 30627, done.
remote: Total 30627 (delta 0), reused 0 (delta 0), pack-reused 30627
Receiving objects: 100% (30627/30627), 224.03 MiB | 1.20 MiB/s, done.
Resolving deltas: 100% (20898/20898), done.
$ du -sh .
462M	.
$ du -sh tests/pypi
214M	tests/pypi
@con-f-use con-f-use changed the title Pipenv git repository too large Pipenv git repository too large due to stored wheels Apr 5, 2019
@frostming frostming added Type: Enhancement 💡 This is a feature or enhancement request. Category: Tests Relates to tests. labels Apr 8, 2019
@jayvdb
Copy link
Contributor

jayvdb commented Apr 9, 2019

Creating a separate repo for them would be very useful for other projects also, as it becomes a canonical set of test data for all the strange packages,etc that are possible. c.f. sarugaku/requirementslib#145

@techalchemy techalchemy added Category: Development Issue affects development workflow. Priority: Medium This item is medium priority and will be resolved whenever possible. Type: Discussion This issue is open for discussion. labels Apr 10, 2019
@techalchemy
Copy link
Member

I'd be interested in feedback about how people think this should be handled. Will LFS actually help with wheels? I don't believe it will do much for tarballs.

@con-f-use
Copy link
Author

con-f-use commented Apr 10, 2019

Short answer LFS will help, especially in the long run.

The problem with wheels (and any non-text data) directly in a git repo is this:

Git cannot track the changes. Whenever a binary files changes just one single bit, git will think it's a completly different file and store both versions, old and new, of the whole file in its entirety, not just the changes. Meaning if you have a 25 MB wheel in your repo, you commit a new version of the same wheel that has 26 MB, the whole repo will now be 51 MB, eventhough little actually changed between the two versions of the wheel. That's why the pipenv repo is currently 562 MB in size, even though all wheels combined in the latest commit are just 214 MB. The difference are older or deleted wheels in historic commits.

Git with LFS stores just links to the files, and fetches them as necessary. The links are tiny. Problem is, you're stuck now with the historic commits, because you can't (shouldn't rewrite commit history). LFS will just prevent things from getting worse and make it possible to "delete" historic files no longer needed without them still cluttering the repo's history making it huger and huger as time progresses. NEVER commit binary data to git repositories, kids!

@techalchemy
Copy link
Member

@con-f-use thanks for the info, that's actually super useful. I have had limited success wiping artifacts from the tree in the past and I am always loathe to do that kind of sweeping history modification (although it is admittedly necessary).

So lets make a path forward here -- lets say we create a separate repo, and lets say we turn on LFS properly etc, what are our options for scrubbing / shrinking the size of this repository? Again I have some experience doing that but with limited success and I'd be hesitant pushing that back up to the remote after destroying the reflog / history

@frostming
Copy link
Contributor

Currently we are using submodules to store pypi artifacts, I think this issue can be closed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Development Issue affects development workflow. Category: Tests Relates to tests. Priority: Medium This item is medium priority and will be resolved whenever possible. Type: Discussion This issue is open for discussion. Type: Enhancement 💡 This is a feature or enhancement request.
Projects
None yet
Development

No branches or pull requests

4 participants