census cell dup check #569

bkmartinjr · 2023-06-22T18:49:40Z

Initial commit of a notebook/tool which searches for potentially mislabeled copies of duplicate cells, i.e., those with potentially incorrect is_primary_data metadata.

PR also includes a change to the pre-commit configuration to lint items in the tools directory beyond the builder.

bkmartinjr · 2023-06-22T20:06:35Z

FYI: @jahilton @brianraymor

atolopko-czi

Looks solid and very interesting! And a masterclass in Python generator usage.

It will be fun to make the reporting more substantial in the future. A few ideas:

add dataset titles
show total cell counts for dataset-oriented tables
a report for the most commonly repeated cells (desc by dup counts)

It would also be interesting to see how may dup cells are false positives due to having low nnz counts, and thus have information-poor "fingerprints" and just match by chance. I suppose we could report nnz counts for dup cells, if that seems fruitful.

tools/cell_dup_check/_csr_iter.py

atolopko-czi · 2023-06-22T21:40:12Z

tools/cell_dup_check/_csr_iter.py

+        coo_reindexer = (t for t in _EagerIterator(coo_reindexer, query._threadpool))
+
+    # lazy convert Arrow table to Scipy sparse.csr_matrix
+    csr_reader: Generator[_RT, None, None] = (


the reader thanks you for the explicit type hint!

tools/cell_dup_check/finddups.ipynb

jahilton

Understanding which datasets are colliding will dramatically speed up the investigation into the cause (most are either all within a Collection or have matching cells counts, but the remainder need the additional info).
1 thought - groupby the obs_duplicate_primary df on hash & collapse the dataset_id into an ordered list. Then grab a value_counts of each dataset_id list.

Also, include the collection_id if readily available

bkmartinjr · 2023-06-22T22:18:43Z

include the collection_id if readily available

@jahilton - ~~the Census has no knowledge of collections and I've been repeatedly waved off from including it.~~ WRONG - I can include, and will.

ps. thanks to @pablo-gar for his foresight to include the datasets table!

tools/cell_dup_check/finddups.ipynb

ebezzi

LGTM. One typo correction left in comments.

codecov · 2023-06-23T15:29:51Z

Codecov Report

Merging #569 (5fa29b5) into main (1eb4b78) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #569   +/-   ##
=======================================
  Coverage   87.79%   87.79%           
=======================================
  Files          59       59           
  Lines        3639     3639           
=======================================
  Hits         3195     3195           
  Misses        444      444

Flag	Coverage Δ
unittests	`87.79% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

bkmartinjr added 3 commits June 22, 2023 18:48

initial commit of notebook

e53e030

fix nb pre-commit

bf579d9

re-run notebook

ff46a34

Merge branch 'main' into bkmartinjr/primary-qc-nb

be7fb5d

bkmartinjr marked this pull request as ready for review June 22, 2023 20:10

bkmartinjr requested review from ebezzi, atolopko-czi and pablo-gar June 22, 2023 20:10

add readme

fe4dfaa

atolopko-czi approved these changes Jun 22, 2023

View reviewed changes

jahilton reviewed Jun 22, 2023

View reviewed changes

bkmartinjr added 3 commits June 22, 2023 23:33

PR feedback

62fc698

add csc

059e836

add dataset overlap

955fa17

ebezzi reviewed Jun 23, 2023

View reviewed changes

tools/cell_dup_check/finddups.ipynb Outdated Show resolved Hide resolved

ebezzi approved these changes Jun 23, 2023

View reviewed changes

bkmartinjr added 2 commits June 23, 2023 15:03

fix typo

08adc15

Merge branch 'main' into bkmartinjr/primary-qc-nb

5fa29b5

bkmartinjr merged commit d61e24f into main Jun 23, 2023

bkmartinjr deleted the bkmartinjr/primary-qc-nb branch June 23, 2023 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

census cell dup check #569

census cell dup check #569

bkmartinjr commented Jun 22, 2023 •

edited

Loading

bkmartinjr commented Jun 22, 2023 •

edited

Loading

atolopko-czi left a comment

atolopko-czi Jun 22, 2023

jahilton left a comment

bkmartinjr commented Jun 22, 2023 •

edited

Loading

ebezzi left a comment

codecov bot commented Jun 23, 2023

census cell dup check #569

census cell dup check #569

Conversation

bkmartinjr commented Jun 22, 2023 • edited Loading

bkmartinjr commented Jun 22, 2023 • edited Loading

atolopko-czi left a comment

Choose a reason for hiding this comment

atolopko-czi Jun 22, 2023

Choose a reason for hiding this comment

jahilton left a comment

Choose a reason for hiding this comment

bkmartinjr commented Jun 22, 2023 • edited Loading

ebezzi left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 23, 2023

Codecov Report

bkmartinjr commented Jun 22, 2023 •

edited

Loading

bkmartinjr commented Jun 22, 2023 •

edited

Loading

bkmartinjr commented Jun 22, 2023 •

edited

Loading