Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use memory efficient matrix subsetting to validate feature_is_filtered #1024

Merged

Conversation

nayib-jose-gloria
Copy link
Contributor

Reason for Change

  • feature_is_filtered validation check was previously loading X matrix into memory if bool column var.feature_is_filtered had any True entry. It would 1) subset the X matrix to only genes with featured_is_filtered is True and 2) validate (without chunking) that this filtered feature subsetted matrix had only zero values as declared by the feature_is_filtered column values.
  • This would, as a result, load the X matrix into memory in order to 1) subset and then 2) count nonzero values and thus require higher resource provisioning in order to support use cases where the featured_is_filtered check was triggered.

Changes

  • in feature_is_filtered validation check, leverage existing matrix nonzero counting function, _count_matrix_nonzero, which implements matrix chunking
    - pass in new optional arg, filter_by_column, to subset each matrix chunk by feature_is_filtered True cols before counting nonzeros

@nayib-jose-gloria nayib-jose-gloria marked this pull request as ready for review September 27, 2024 19:19
@nayib-jose-gloria nayib-jose-gloria merged commit e9d7581 into ebezzi/anndata-backed-fix Sep 27, 2024
@nayib-jose-gloria nayib-jose-gloria deleted the nayib/feature-is-filtered-mem-fix branch September 27, 2024 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant