Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving raw h5ad files upon running snap.pp.import_data #344

Open
joseverdezoto opened this issue Oct 4, 2024 · 2 comments
Open

Saving raw h5ad files upon running snap.pp.import_data #344

joseverdezoto opened this issue Oct 4, 2024 · 2 comments

Comments

@joseverdezoto
Copy link

Hey Kai,

Thanks so much for developing snapATAC2. I've been using it for a bit and really like it! I do have one question regarding working with files in backed mode. When we import fragment files and work in backed mode, I realize the in-disk files will get updated as we do something to the adatas (e.g., filtering low QC cells). However, if we wanted to go back to the raw h5ad files, those would no longer be "raw" right? Ideally, I'd love to save the raw h5ad files so I don't have to re-import data.

I wasn't sure what would be the best practices workflow for that. I naively decided to copy the newly created h5ad files into a raw dir when I run snap.pp.import_data for new datasets, so I can re-use the raw h5ad files later if needed. However, when checking the files after the fact, it looks like the files I copied into the raw dir are smaller (likely updated after QC) instead of the ones I keep in the processing dir and meant to update. So I have a couple of questions:

  1. If we import data with snap.pp.import_data and make a copy of the h5ad files, would the adata object reference the original or the copied h5ad file in disk?
  2. Would you have any suggestions for saving the raw h5ad files upon importing data?

Thanks!

@kaizhang
Copy link
Owner

kaizhang commented Oct 6, 2024

Hey Kai,

Thanks so much for developing snapATAC2. I've been using it for a bit and really like it! I do have one question regarding working with files in backed mode. When we import fragment files and work in backed mode, I realize the in-disk files will get updated as we do something to the adatas (e.g., filtering low QC cells). However, if we wanted to go back to the raw h5ad files, those would no longer be "raw" right? Ideally, I'd love to save the raw h5ad files so I don't have to re-import data.

I wasn't sure what would be the best practices workflow for that. I naively decided to copy the newly created h5ad files into a raw dir when I run snap.pp.import_data for new datasets, so I can re-use the raw h5ad files later if needed. However, when checking the files after the fact, it looks like the files I copied into the raw dir are smaller (likely updated after QC) instead of the ones I keep in the processing dir and meant to update. So I have a couple of questions:

  1. If we import data with snap.pp.import_data and make a copy of the h5ad files, would the adata object reference the original or the copied h5ad file in disk?

It will be a full copy, not a reference. The fact that it is smaller is expected. This has to do with the design of hdf5 format. Once you store something in hdf5 file, you cannot really get rid of it. Instead, you simply mask it so that it is no long accessible. But it still takes up space. When you saving the hdf5 object to a new file, you only save those necessary parts.

  1. Would you have any suggestions for saving the raw h5ad files upon importing data?

Just use ".write()" at any point you want to save the current state.

Thanks!

@joseverdezoto
Copy link
Author

joseverdezoto commented Oct 7, 2024

Thank you for the prompt response! Just to make sure I understand, if I import N fragment files to create N h5ad objects and I want to immediately save them to a raw state dir, I'd just have to do something like adata.write("/path/to/raw_dir") instead of copying right? I was looking at the API and I noticed there's both adata.write() and adata.copy(). What's the difference?

And then say, if I wanted to load the raw files later to play around with some QC thresholds but would still like to keep a raw version of them, should I copy those h5ad files before loading them? or load them and use adata.write() to preserve the raw state?

Sorry if this is a bit of a redundant question, but just want to make sure we're handling files at different states of the processing workflow appropriately.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants