Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: memory efficent backed anndata read #1021

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

nayib-jose-gloria
Copy link
Contributor

@nayib-jose-gloria nayib-jose-gloria commented Sep 25, 2024

Reason for Change

  • anndata "backed" mode reads still loadX matrix into memory. Update read_h5ad to actually read backed objects
  • feature_is_filtered validation check was previously loading X matrix into memory if bool column var.feature_is_filtered had any True entry. It would 1) subset the X matrix to only genes with featured_is_filtered is True and 2) validate (without chunking) that this filtered feature subsetted matrix had only zero values as declared by the feature_is_filtered column values.
  • This would, as a result, load the X matrix into memory in order to 1) subset and then 2) count nonzero values and thus require higher resource provisioning in order to support use cases where the featured_is_filtered check was triggered.

Changes

  • in feature_is_filtered validation check, leverage existing matrix nonzero counting function, _count_matrix_nonzero, which implements matrix chunking
    - pass in new optional arg, filter_by_column, to subset each matrix chunk by feature_is_filtered True cols before counting nonzeros

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants