fix: use memory efficient matrix subsetting to validate feature_is_filtered #1024

nayib-jose-gloria · 2024-09-26T21:21:55Z

Reason for Change

feature_is_filtered validation check was previously loading X matrix into memory if bool column var.feature_is_filtered had any True entry. It would 1) subset the X matrix to only genes with featured_is_filtered is True and 2) validate (without chunking) that this filtered feature subsetted matrix had only zero values as declared by the feature_is_filtered column values.
This would, as a result, load the X matrix into memory in order to 1) subset and then 2) count nonzero values and thus require higher resource provisioning in order to support use cases where the featured_is_filtered check was triggered.

in feature_is_filtered validation check, leverage existing matrix nonzero counting function, _count_matrix_nonzero, which implements matrix chunking
- pass in new optional arg, filter_by_column, to subset each matrix chunk by feature_is_filtered True cols before counting nonzeros

…ltered

nayib-jose-gloria added 2 commits September 26, 2024 17:12

fix: use memory efficient matrix subsetting to validate feature_is_fi…

2c17a2a

…ltered

remove debug logger

c0d712f

nayib-jose-gloria requested review from ebezzi and Bento007 September 26, 2024 21:22

fix tests

8db83c6

nayib-jose-gloria marked this pull request as ready for review September 27, 2024 19:19

nayib-jose-gloria merged commit e9d7581 into ebezzi/anndata-backed-fix Sep 27, 2024

nayib-jose-gloria deleted the nayib/feature-is-filtered-mem-fix branch September 27, 2024 19:20