Split Data Release Bundle in to multiple gzips #42

ghost · 2022-07-29T16:32:10Z

Considering the growth of data and Network accessibility required for downloading the data release bundles we want to split the data release output file in multiple batches.

May be restrict bundle sizes to 500MB. The compressed bundle size is ~ 2 GB for 350000 Samples (10 GB actual data)

ghost · 2022-08-09T16:51:00Z

Topic of discussion with Team : Do we want to provide both: entire sample and split samples?
@caravinci feel free to post your views here.

justincorrigible · 2022-08-09T17:32:38Z

When I suggested we could "multipart" the zip files, was with the notion of increasingly larger data sets. The two big files (1 TSV + 1 Fasta) would be broken into smaller parts while getting compressed by the library, and are put back together into the same original 2 big files when the user decompresses them.

However, the use for the whole dataset seems limited, and this seems to me like a solution for a very specific set of users.
Furthermore, @joneubank's suggestion to generate Delta archives would remove the need to offer multipart archives entirely, which seems more desirable to me, both from infrastructural and operational perspectives, as well as in terms of usefulness.

ghost · 2022-08-09T17:44:07Z

Actually Delta archives is the solution to a very specific set of users. So I would but that on the back burner for now. So i think Splitting data sets will be our best way to go forward from here. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split Data Release Bundle in to multiple gzips #42

Split Data Release Bundle in to multiple gzips #42

ghost commented Jul 29, 2022

ghost commented Aug 9, 2022

justincorrigible commented Aug 9, 2022 •

edited

Loading

ghost commented Aug 9, 2022

Split Data Release Bundle in to multiple gzips #42

Split Data Release Bundle in to multiple gzips #42

Comments

ghost commented Jul 29, 2022

ghost commented Aug 9, 2022

justincorrigible commented Aug 9, 2022 • edited Loading

ghost commented Aug 9, 2022

justincorrigible commented Aug 9, 2022 •

edited

Loading