Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git large-file storage #82

Open
jruhym opened this issue Feb 14, 2017 · 7 comments
Open

Git large-file storage #82

jruhym opened this issue Feb 14, 2017 · 7 comments

Comments

@jruhym
Copy link

jruhym commented Feb 14, 2017

It might be useful to store the files currently being downloaded by 1.download.ipynb on git's large file storage. That way we can eliminate 1.download.ipynb and have the data files under version control.
https://git-lfs.github.com/

It needs to be investigated whether git-lfs can be incorporated via conda into the environment automatically.

@KT12
Copy link
Contributor

KT12 commented Feb 15, 2017

Semi-related to this topic, would it make sense to pickle the X and Y matrices? It takes about 2 minutes for my machine to load each one every time I start a Notebook. I would think the un-pickling would be faster than this.

@dhimmel
Copy link
Member

dhimmel commented Feb 15, 2017

Semi-related to this topic, would it make sense to pickle the X and Y matrices?

@KT12 yes great point. cognoml currently does pickle for the reading speed-up.

I think it makes sense to use Git LFS to store these pickles in the machine-learning repo. And the cancer-data repo can store the compressed TSVs using Git LFS.

I reached out to GitHub support to see if we can get some LFS capacity.

@jruhym
Copy link
Author

jruhym commented Feb 16, 2017

One can incorporate git-lfs via conda by adding the lines to the environment.yml
channels:
- defaults
- conda-forge
dependencies:
.
.
.
- git-lfs=1.5.5
This will incorporate the dependency as found here.

I have tested this on my machine and it works. I know that @dhimmel is waiting to hear back from GitHub, but I did attempt to test the Git LFS on a fork of machine-learning but ran into an issue where one cannot use Git LFS on a fork of a repo unless it is already used on the main project, as mentioned by technoweenie here. I ended up with the following error message,
batch response: http: @jruhym can not upload new objects to public fork jruhym/machine-learning,
when I tried to push.

@dhimmel
Copy link
Member

dhimmel commented Feb 16, 2017

@jruhym nice... git-lfs as part of the conda environment will add convenience. Regarding channels, I think we'll want defaults to precede conda-forge.

@jruhym
Copy link
Author

jruhym commented Feb 16, 2017

@dhimmel I updated my comment to address your suggestion.

@dhimmel
Copy link
Member

dhimmel commented Feb 17, 2017

Okay we now have LFS capacity on GitHub through their education program! Thanks @github for the generosity.

I will submit a pull request on cancer-data to add git-lfs. Then we can upload pickled versions here.

@dhimmel
Copy link
Member

dhimmel commented Feb 28, 2017

See pandas-dev/pandas#13317 (comment)

Pickled files load almost instantly but are over 1 GB uncompressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants