Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split Data Release Bundle in to multiple gzips #42

Open
ghost opened this issue Jul 29, 2022 · 3 comments
Open

Split Data Release Bundle in to multiple gzips #42

ghost opened this issue Jul 29, 2022 · 3 comments

Comments

@ghost
Copy link

ghost commented Jul 29, 2022

Considering the growth of data and Network accessibility required for downloading the data release bundles we want to split the data release output file in multiple batches.

May be restrict bundle sizes to 500MB. The compressed bundle size is ~ 2 GB for 350000 Samples (10 GB actual data)

@ghost
Copy link
Author

ghost commented Aug 9, 2022

Topic of discussion with Team : Do we want to provide both: entire sample and split samples?
@caravinci feel free to post your views here.

@justincorrigible
Copy link
Contributor

justincorrigible commented Aug 9, 2022

When I suggested we could "multipart" the zip files, was with the notion of increasingly larger data sets. The two big files (1 TSV + 1 Fasta) would be broken into smaller parts while getting compressed by the library, and are put back together into the same original 2 big files when the user decompresses them.

However, the use for the whole dataset seems limited, and this seems to me like a solution for a very specific set of users.
Furthermore, @joneubank's suggestion to generate Delta archives would remove the need to offer multipart archives entirely, which seems more desirable to me, both from infrastructural and operational perspectives, as well as in terms of usefulness.

@ghost
Copy link
Author

ghost commented Aug 9, 2022

Actually Delta archives is the solution to a very specific set of users. So I would but that on the back burner for now. So i think Splitting data sets will be our best way to go forward from here. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant