fix: memory efficent backed anndata read #1021

nayib-jose-gloria · 2024-09-25T13:34:16Z

Reason for Change

anndata "backed" mode reads still loadX matrix into memory. Update read_h5ad to actually read backed objects
feature_is_filtered validation check was previously loading X matrix into memory if bool column var.feature_is_filtered had any True entry. It would 1) subset the X matrix to only genes with featured_is_filtered is True and 2) validate (without chunking) that this filtered feature subsetted matrix had only zero values as declared by the feature_is_filtered column values.
This would, as a result, load the X matrix into memory in order to 1) subset and then 2) count nonzero values and thus require higher resource provisioning in order to support use cases where the featured_is_filtered check was triggered.

in feature_is_filtered validation check, leverage existing matrix nonzero counting function, _count_matrix_nonzero, which implements matrix chunking
- pass in new optional arg, filter_by_column, to subset each matrix chunk by feature_is_filtered True cols before counting nonzeros

…ltered (#1024)

ebezzi and others added 8 commits August 21, 2024 15:25

Fix for reading anndata files in backed mode

7debf97

fix

981c107

fix

cc91e69

fix

864c60f

Merge branch 'main' into ebezzi/anndata-backed-fix

a6fd84b

fix: use memory efficient matrix subsetting to validate feature_is_fi…

e9d7581

…ltered (#1024)

lint

4864abf

fix import

de8db74