-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand doc-site section "Census data releases" and add data erratum #600
Conversation
confirmed. It was calculated with |
|
||
To enable data stability and scientific reproducibility, [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/) plans to perform regular LTS Census data releases: | ||
|
||
* Published online every six months for public access, starting on May 15, 2023. | ||
* Available for public access for at least 5 years upon publication. | ||
|
||
The latest LTS Census data release is the default opened by the APIs and recognized as the `census_version = "stable"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to caveat/footnote that "stable"
will change every time a new LTS is released, and the perma-name for it is 2023-05-15
### LTS 2023-05-15 | ||
|
||
Open this data release by specifying `census_version = "2023-06-28"` in future calls to `open_soma()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
date mismatch - it does not specify the LTS: s/2023-06-28/2023-05-15/
#### 🔴 Erratum 🔴 | ||
|
||
There are 243,569 number of cells labelled as `is_primary_data = True` for which a fraction of them the label is incorrect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard to understand wording...
Perhaps: There are 243,569 cells labelled as is_primary_data = True
, which have a duplicate cell labelled the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're being specific about the 243,569 observations labeled, then I would be specific about how many of those are incorrect, instead of saying "a fraction".
And just to be clear, we still have 2 cases that are being reviewed so these numbers may change. And as such, I didn't review the specific counts, etc.
@@ -1,23 +1,110 @@ | |||
# Census data releases | |||
|
|||
**Last edited**: April, 2023. | |||
**Last edited**: July, 2023. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend full date in case it is edited more than once in a given month
Open this data release by specifying `census_version = "2023-06-28"` in future calls to `open_soma()`. | ||
|
||
#### 🔴 Erratum 🔴 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using plural form, i.e., Errata
seems unlikely we will only have one in due time :-)
Codecov Report
@@ Coverage Diff @@
## main #600 +/- ##
==========================================
- Coverage 88.10% 88.07% -0.03%
==========================================
Files 62 62
Lines 3740 3741 +1
==========================================
Hits 3295 3295
- Misses 445 446 +1
Flags with carried forward coverage won't be shown. Click here to find out more. see 2 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There are 243,569 number of cells labelled as `is_primary_data = True` for which a fraction of them the label is incorrect. | ||
|
||
Such label indicates that a cell is the primary representation of an observation, otherwise a cell is deemed to be a duplicate representation. Based on their count vectors, these 243,569 number of cells are represented at least twice with `is_primary_data = True`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such label indicates that a cell is the primary representation of an observation - I think this will introduce confusion (What does "primary representation" mean?).
Based on their count vectors - I don't think it matters how we found them
Suggest something like...
In order to prevent duplicate data in analyses, each observation should be marked is_primary data = True
exactly once in the Census. Since this LTS release, 243,569 observations have been identified that are represented at least twice with is_primary_data = True
.
* Available for public access for 1 week or until the next latest release is performed, whichever is the longest. | ||
|
||
The weekly release can be opened by the APIs by specifying `census_version = "latest"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...The most recent weekly release...
(other weekly releases must be opened by name)
* Available for public access for 1 week or until the next latest release is performed, whichever is the longest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are available for a month by explicit name
**Contents** | ||
|
||
1. [What is a Census data release?](#What-is-a-Census-data-release) | ||
2. [List of LTS Census data releases](#List-of-LTS-Census-data-releases) | ||
|
||
## What is a Census data release? | ||
|
||
It is a Census build that is publicly hosted online. A Census build is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth stating: Any given Census build is named with a unique tag, normally the date of build, e.g., 2020-01-30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left a couple of optional nit comments - fix as you see fit.
In "Census data releases" section:
is_primary_data=True
single-cell-curation#528It also adds a file with the soma_joinids of the incorrectly labelled cells.
Notes for reviewers