Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite Github History #404

Closed
yvovandoorn opened this issue May 19, 2016 · 7 comments
Closed

Rewrite Github History #404

yvovandoorn opened this issue May 19, 2016 · 7 comments
Assignees

Comments

@yvovandoorn
Copy link
Contributor

  • Removes large PNG files
  • removes large PDF files

Rewriting history is usually left to politicians. In this case its good to trim down the repo history for those getting started to get a reduced, in download size, repo.

@bridgetkromhout was interested in owning once the dust had settled.

@yvovandoorn
Copy link
Contributor Author

yvovandoorn commented May 21, 2016

So I wanted to learn how to do this and ran it as an experiment on a forked version of devopsdays-web. I'm sure there are more practical ways, but it was a good exercise none the less.

References of inspiration:

Step 1 was installing bfg:
brew install bfg

I started by cloning out a regular version of the repo (my fork):
git clone [email protected]:yvovandoorn/devopsdays-web.git

Then I needed to get all shas ever and a list of big objects:
git rev-list --objects --all | sort -k 2 > allfileshas.txt

git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^\w+ blob\W+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > bigobjects.txt

So I have a list of all the files and a list of bigobjects. Time to match 'em up:

for SHA in cut -f 1 -d\ < bigobjects.txt; do echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | awk '{print $1,$3,$7}' >> bigtosmall.txt; done

My mission was just to target PDFs and PNGs:
egrep "png|pdf" bigtosmall.txt > allpdf_and_png_files.txt

So now I have a list of PDFs and PNGs, but some are still real (like a lot of logo.pngs). This is where I got a little creative in capturing the negative output.

for i in $(awk '{ print $3 }' allpdf_and_png_files.txt ); do file $i; done | grep "No such file" > removed_files.txt

Then run it through awk, sort and uniq to make sure to weed out any potential duplicates (sometimes the sha + blog matching results into duplicates).

awk -F":" '{ print $1 }' removed_files.txt | sort | uniq > removed_files_list.txt

This got me a list of files that were truly out of HEAD and shouldn't be part of the history anymore. Having used git filter-branch in the past, I knew it was a PITA for it to do it as it takes forever. I remembered BFG but BFG is path-independent (ref: http://stackoverflow.com/questions/21142986/remove-filenames-from-specific-path). I needed assurance that I was removing files that needed to be removed. BFG does have --strip-blobs-with-ids <blob-ids-file> as an option. So now I needed to get all the blob IDs on the files I wanted to remove.

for file in $(cat removed_files_list.txt); do git log --format=%H -- $file | xargs -IcommitId git rev-parse commitId:$file; done > blobs.txt

This produced some fatals only because of some of the files we're trying to remove were no longer there in subsequent commits. So not a bad thing. The file produced all blobs needed to be removed. I cleaned it out (some entries had initial commit of said file to be "blob ID only".

awk -F":" '{ print $1 }' blobs.txt > blobs-with-ids

Now I had the blobs!

BFG only works on mirrored versions of a checkout so off I went to checkout my repo again:

git clone --mirror [email protected]:yvovandoorn/devopsdays-web.git devopsdays-web-mirror

I copied my blobs-with-ids file to ~/Development (directory that hosts all my git repos) and ran:
bfg -bi blobs-with-ids devopsdays-web-mirror

Which produced a result indicating all deleted files.

I changed into the mirrored directory and ran (according to BFG instructions):
git reflog expire --expire=now --all && git gc --prune=now --aggressive

Now for a final git push to my fork.

I did a mkdir size-diff && cd size-diff && git clone [email protected]:yvovandoorn/devopsdays-web.git devopsdays-web-cleaned && git clone [email protected]:devopsdays/devopsdays-web.git

Then ran du -sh *(/), which produced:

Development/size-diff » du -sh *(/)
321M    devopsdays-web
247M    devopsdays-web-cleaned

Result: 74MB trimmed!

Sources of inspiration

@mattstratton
Copy link
Member

Very cool and this is a useful write up!

@bridgetkromhout
Copy link
Collaborator

Very cool. Go ahead and do it for real and write a blog post from these notes!

@yvovandoorn
Copy link
Contributor Author

And done!

323M    devopsdays-web
246M    devopsdays-web-new

@bridgetkromhout
Copy link
Collaborator

bridgetkromhout commented May 22, 2016

Gah. I had a branch I created off of mattstratton/redesign, and didn't check to make sure it was up to date. When I pushed it up, it brought all the giant files back.

Falling asleep here soon. Will try to reverse in the morning. (I deleted the branch - bridget-program-updates - but the files it brought in are still in history. I suspect this is going to happen to more people than just me.)

@yvovandoorn
Copy link
Contributor Author

yvovandoorn commented May 22, 2016

Ok re-did this again with filter.

Generating SHAs & objects
git rev-list --objects --all | sort -k 2 > allfileshas.txt
git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^\w+ blob\W+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > bigobjects.txt

Combining SHAs & objects to names and then sizing then big to small
for SHA in cut -f 1 -d\ < bigobjects.txt; do echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | awk '{print $1,$3,$7}' >> bigtosmall.txt; done

Filtering out just PDFs and PNGs
egrep "png|pdf" bigtosmall.txt > allpdf_and_png_files.txt

Grabbing no longer existing PDFs and PNGs
for i in $(awk '{ print $3 }' allpdf_and_png_files.txt ); do file $i; done | grep "No such file" > removed_files.txt

Get rid of all the other things, I just care about files.
awk -F":" '{ print $1 }' removed_files.txt | sort | uniq > removed_files_list.txt

sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ /g' removed_files_list.txt

Then I got lazy and just posted all the files at once (yes I could've just done a cat removed_files_list_lines but meh).
git filter-branch -f --index-filter "git rm -rf --cached --ignore-unmatch static/events/2016-amsterdam/devopsdays_2016_-_sponsor_prospectus.pdf static/events/2016-amsterdam/speakers/adam-jacob.png static/events/2016-amsterdam/speakers/andrew-farley.png static/events/2016-amsterdam/speakers/avishai-ish-shalom.png static/events/2016-amsterdam/speakers/bernd-erk.png static/events/2016-amsterdam/speakers/christoph-andreas-torlinsky.png static/events/2016-amsterdam/speakers/daniel-van-gils.png static/events/2016-amsterdam/speakers/desmond-delissen.png static/events/2016-amsterdam/speakers/erica-baker.png static/events/2016-amsterdam/speakers/gopal-ramachandran.png static/events/2016-amsterdam/speakers/hannah-foxwell.png static/events/2016-amsterdam/speakers/jason-yee.png static/events/2016-amsterdam/speakers/jody-wolfborn.png static/events/2016-amsterdam/speakers/ken-mugrage.png static/events/2016-amsterdam/speakers/marco-ceppi.png static/events/2016-amsterdam/speakers/melanie-rieback.png static/events/2016-amsterdam/speakers/michael-friedrich.png static/events/2016-amsterdam/speakers/pavel-chunyayev.png static/events/2016-amsterdam/speakers/robert-den-broeder.png static/events/2016-amsterdam/speakers/simon-fisher.png static/events/2016-amsterdam/speakers/stefan-stolzle.png static/events/2016-amsterdam/speakers/takahiko-ito.png static/events/2016-amsterdam/speakers/victoria-jeffrey.png static/events/2016-amsterdam/speakers/warner-moore.png static/events/2016-amsterdam/speakers/will-button.png static/events/2016-london/presenters/images/Philippe-Guenet.png static/events/2016-london/speakers/andi-mann.png static/events/2016-london/speakers/benjamin-wootton.png static/events/2016-london/speakers/casey-west.png static/events/2016-london/speakers/claire-agutter.png static/events/2016-london/speakers/gareth-rushgrove.png static/events/2016-london/speakers/gene-kim.png static/events/2016-london/speakers/jeromy-carriere.png static/events/2016-london/speakers/joanne-molesky.png static/events/2016-london/speakers/john-clapham.png static/events/2016-london/speakers/justin-cormack.png static/events/2016-london/speakers/kris-saxton.png static/events/2016-london/speakers/oliver-wood.png static/events/2016-london/speakers/thiago-almeida.png static/events/2016-minneapolis/speakers/allan-espinosa.png static/events/2016-minneapolis/speakers/ben-zvan.png static/events/2016-minneapolis/speakers/charity-majors.png static/events/2016-minneapolis/speakers/jeff-smith.png static/events/2016-minneapolis/speakers/megan-carney.png static/events/2016-minneapolis/speakers/nicole-forsgren.png static/events/2016-minneapolis/speakers/sarah-goff-dupont.png static/img/events/2016-atlanta/logo.png static/img/events/2016-chicago/logo.png static/img/events/2016-minneapolis/logo.png static/img/sponsors/2016-chef.png static/img/sponsors/2016-thoughtworks.png static/img/sponsors/thoughtworks.png themes/devopsdays-responsive/static/images/devopsdays-brain.png" -- --all

Clean up local repo
rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now
git gc --aggressive --prune=now

And off it goes
git push --all --force

@yvovandoorn
Copy link
Contributor Author

Closing this issue as the history has been re-written and @devopsdays/web group is aware of the re-write.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants