Saving raw h5ad files upon running snap.pp.import_data #344

joseverdezoto · 2024-10-04T15:24:23Z

Hey Kai,

Thanks so much for developing snapATAC2. I've been using it for a bit and really like it! I do have one question regarding working with files in backed mode. When we import fragment files and work in backed mode, I realize the in-disk files will get updated as we do something to the adatas (e.g., filtering low QC cells). However, if we wanted to go back to the raw h5ad files, those would no longer be "raw" right? Ideally, I'd love to save the raw h5ad files so I don't have to re-import data.

I wasn't sure what would be the best practices workflow for that. I naively decided to copy the newly created h5ad files into a raw dir when I run snap.pp.import_data for new datasets, so I can re-use the raw h5ad files later if needed. However, when checking the files after the fact, it looks like the files I copied into the raw dir are smaller (likely updated after QC) instead of the ones I keep in the processing dir and meant to update. So I have a couple of questions:

If we import data with snap.pp.import_data and make a copy of the h5ad files, would the adata object reference the original or the copied h5ad file in disk?
Would you have any suggestions for saving the raw h5ad files upon importing data?

Thanks!

The text was updated successfully, but these errors were encountered:

kaizhang · 2024-10-06T01:48:17Z

Hey Kai,

Thanks so much for developing snapATAC2. I've been using it for a bit and really like it! I do have one question regarding working with files in backed mode. When we import fragment files and work in backed mode, I realize the in-disk files will get updated as we do something to the adatas (e.g., filtering low QC cells). However, if we wanted to go back to the raw h5ad files, those would no longer be "raw" right? Ideally, I'd love to save the raw h5ad files so I don't have to re-import data.

I wasn't sure what would be the best practices workflow for that. I naively decided to copy the newly created h5ad files into a raw dir when I run snap.pp.import_data for new datasets, so I can re-use the raw h5ad files later if needed. However, when checking the files after the fact, it looks like the files I copied into the raw dir are smaller (likely updated after QC) instead of the ones I keep in the processing dir and meant to update. So I have a couple of questions:

If we import data with snap.pp.import_data and make a copy of the h5ad files, would the adata object reference the original or the copied h5ad file in disk?

It will be a full copy, not a reference. The fact that it is smaller is expected. This has to do with the design of hdf5 format. Once you store something in hdf5 file, you cannot really get rid of it. Instead, you simply mask it so that it is no long accessible. But it still takes up space. When you saving the hdf5 object to a new file, you only save those necessary parts.

Would you have any suggestions for saving the raw h5ad files upon importing data?

Just use ".write()" at any point you want to save the current state.

Thanks!

joseverdezoto · 2024-10-07T13:10:51Z

Thank you for the prompt response! Just to make sure I understand, if I import N fragment files to create N h5ad objects and I want to immediately save them to a raw state dir, I'd just have to do something like adata.write("/path/to/raw_dir") instead of copying right? I was looking at the API and I noticed there's both adata.write() and adata.copy(). What's the difference?

And then say, if I wanted to load the raw files later to play around with some QC thresholds but would still like to keep a raw version of them, should I copy those h5ad files before loading them? or load them and use adata.write() to preserve the raw state?

Sorry if this is a bit of a redundant question, but just want to make sure we're handling files at different states of the processing workflow appropriately.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving raw h5ad files upon running snap.pp.import_data #344

Saving raw h5ad files upon running snap.pp.import_data #344

joseverdezoto commented Oct 4, 2024

kaizhang commented Oct 6, 2024

joseverdezoto commented Oct 7, 2024 •

edited

Loading

Saving raw h5ad files upon running snap.pp.import_data #344

Saving raw h5ad files upon running snap.pp.import_data #344

Comments

joseverdezoto commented Oct 4, 2024

kaizhang commented Oct 6, 2024

joseverdezoto commented Oct 7, 2024 • edited Loading

joseverdezoto commented Oct 7, 2024 •

edited

Loading