-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
census cell dup check #569
Conversation
FYI: @jahilton @brianraymor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid and very interesting! And a masterclass in Python generator usage.
It will be fun to make the reporting more substantial in the future. A few ideas:
- add dataset titles
- show total cell counts for dataset-oriented tables
- a report for the most commonly repeated cells (desc by dup counts)
It would also be interesting to see how may dup cells are false positives due to having low nnz counts, and thus have information-poor "fingerprints" and just match by chance. I suppose we could report nnz counts for dup cells, if that seems fruitful.
coo_reindexer = (t for t in _EagerIterator(coo_reindexer, query._threadpool)) | ||
|
||
# lazy convert Arrow table to Scipy sparse.csr_matrix | ||
csr_reader: Generator[_RT, None, None] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reader thanks you for the explicit type hint!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understanding which datasets are colliding will dramatically speed up the investigation into the cause (most are either all within a Collection or have matching cells counts, but the remainder need the additional info).
1 thought - groupby the obs_duplicate_primary
df on hash & collapse the dataset_id into an ordered list. Then grab a value_counts of each dataset_id list.
Also, include the collection_id if readily available
@jahilton - ps. thanks to @pablo-gar for his foresight to include the datasets table! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One typo correction left in comments.
Codecov Report
@@ Coverage Diff @@
## main #569 +/- ##
=======================================
Coverage 87.79% 87.79%
=======================================
Files 59 59
Lines 3639 3639
=======================================
Hits 3195 3195
Misses 444 444
Flags with carried forward coverage won't be shown. Click here to find out more. 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Initial commit of a notebook/tool which searches for potentially mislabeled copies of duplicate cells, i.e., those with potentially incorrect
is_primary_data
metadata.PR also includes a change to the pre-commit configuration to lint items in the tools directory beyond the builder.