Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

census cell dup check #569

Merged
merged 10 commits into from
Jun 23, 2023
Merged

census cell dup check #569

merged 10 commits into from
Jun 23, 2023

Conversation

bkmartinjr
Copy link
Contributor

@bkmartinjr bkmartinjr commented Jun 22, 2023

Initial commit of a notebook/tool which searches for potentially mislabeled copies of duplicate cells, i.e., those with potentially incorrect is_primary_data metadata.

PR also includes a change to the pre-commit configuration to lint items in the tools directory beyond the builder.

@bkmartinjr
Copy link
Contributor Author

bkmartinjr commented Jun 22, 2023

FYI: @jahilton @brianraymor

@bkmartinjr bkmartinjr marked this pull request as ready for review June 22, 2023 20:10
Copy link
Collaborator

@atolopko-czi atolopko-czi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid and very interesting! And a masterclass in Python generator usage.

It will be fun to make the reporting more substantial in the future. A few ideas:

  • add dataset titles
  • show total cell counts for dataset-oriented tables
  • a report for the most commonly repeated cells (desc by dup counts)

It would also be interesting to see how may dup cells are false positives due to having low nnz counts, and thus have information-poor "fingerprints" and just match by chance. I suppose we could report nnz counts for dup cells, if that seems fruitful.

tools/cell_dup_check/_csr_iter.py Outdated Show resolved Hide resolved
tools/cell_dup_check/_csr_iter.py Outdated Show resolved Hide resolved
coo_reindexer = (t for t in _EagerIterator(coo_reindexer, query._threadpool))

# lazy convert Arrow table to Scipy sparse.csr_matrix
csr_reader: Generator[_RT, None, None] = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reader thanks you for the explicit type hint!

tools/cell_dup_check/finddups.ipynb Outdated Show resolved Hide resolved
tools/cell_dup_check/finddups.ipynb Outdated Show resolved Hide resolved
Copy link
Collaborator

@jahilton jahilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understanding which datasets are colliding will dramatically speed up the investigation into the cause (most are either all within a Collection or have matching cells counts, but the remainder need the additional info).
1 thought - groupby the obs_duplicate_primary df on hash & collapse the dataset_id into an ordered list. Then grab a value_counts of each dataset_id list.

Also, include the collection_id if readily available

@bkmartinjr
Copy link
Contributor Author

bkmartinjr commented Jun 22, 2023

include the collection_id if readily available

@jahilton - the Census has no knowledge of collections and I've been repeatedly waved off from including it. WRONG - I can include, and will.

ps. thanks to @pablo-gar for his foresight to include the datasets table!

Copy link
Member

@ebezzi ebezzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One typo correction left in comments.

@codecov
Copy link

codecov bot commented Jun 23, 2023

Codecov Report

Merging #569 (5fa29b5) into main (1eb4b78) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #569   +/-   ##
=======================================
  Coverage   87.79%   87.79%           
=======================================
  Files          59       59           
  Lines        3639     3639           
=======================================
  Hits         3195     3195           
  Misses        444      444           
Flag Coverage Δ
unittests 87.79% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bkmartinjr bkmartinjr merged commit d61e24f into main Jun 23, 2023
@bkmartinjr bkmartinjr deleted the bkmartinjr/primary-qc-nb branch June 23, 2023 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants