Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gzip decompression from file #48

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DerAndereJohannes
Copy link

Hi.
I generally prefer to store my acqknowledge files in gzip format to save quite a bit of storage space. Using this package I have to usually add a bit of boilerplate to my code in order to read the gzipped file into an io.BytesIO object and then pass that into bioread.

This works fine, but I was wondering if you would be open to allowing directly adding .gz files into the input of bioread.read which would do this boilerplate automatically.

Thank you in advance for your time.

@njvack
Copy link
Member

njvack commented Jul 15, 2024

Wait, why not use the compressed .acq format? That compresses a whole heck of a lot better than gzipping interleaved data and doesn't require any extra steps to read.

I'm not super keen on adding .acq.gz as a natively-supported format, I think

@DerAndereJohannes
Copy link
Author

oof, this is quite a way to find out that there are already ways to compress these files natively. Funnily enough, at least for the files I receive, gzipping them manually gets me pretty good results. here is an example file of a recording not too long ago:

gzip.exe --list .\2024-07-10-005.acq.gz
 compressed        uncompressed  ratio uncompressed_name
   62798861           228915926  72.6% .\2024-07-10-005.acq

I'm guessing since you use zlib to decompress, the compression ratios must be somewhat similar to these though. Or does acqknowledge not store the data in an interleaved fashion when it itself compresses?

Thanks for the heads up and the clarification, I will go look into the native acqknowledge options. I understand that this pull request may be not as relevant as I had thought.

@njvack
Copy link
Member

njvack commented Jul 15, 2024

That is a pretty good ratio!

But yes, when you're using the built-in compression, each channel's data is chunked together before compression; since physio data tend to be quite autocorrelated it tends to do better than just running the normal file through gzip.

I'm kinda torn; this is a super clean PR and it doesn't add much code and it would make life easier for at least some folk. Let me think on it a little bit, and very seriously, thank you for your contribution.

@DerAndereJohannes
Copy link
Author

Thanks for the info. I went to the lab today and checked out the compression options, you are right that the native compression does beat out gzip (not really a surprise):

Raw File: 228915926 bytes
Gzip Compression: 62798874 Bytes (72.6% Ratio)
Acq Compression: 40934100 Bytes (82.1% Ratio)

I should have probably tested this on more files, but I did not have enough time for this now.

The non-native compression really only has two arguments for it:

  • Does not require Acqknowledge software / dongle to perform
  • You retain the ability to append to the file if you reimport it back into acqknowledge (after decompression)

I am also torn on what I should do. By automatically gzipping in my current pipeline, I don't have to worry about the files that people are sending me and am guaranteed pretty good results regardless if they send natively compressed or not. But you are right, this is probably not that orthodox.

Thank you for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants