Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand doc-site section "Census data releases" and add data erratum #600

Merged
merged 7 commits into from
Jul 7, 2023

Conversation

pablo-gar
Copy link
Contributor

In "Census data releases" section:

It also adds a file with the soma_joinids of the incorrectly labelled cells.

Notes for reviewers

  • @bkmartinjr can you confirm that the file you sent over email is for the data release 2023-05-15?
  • @brianraymor and @jahilton can you make an editorial review of the erratum?

@pablo-gar pablo-gar marked this pull request as ready for review July 4, 2023 21:54
@bkmartinjr
Copy link
Contributor

can you confirm that the file you sent over email is for the data release 2023-05-15

confirmed. It was calculated with open_soma(census_version='latest') which resolved to 2023-05-15


To enable data stability and scientific reproducibility, [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/) plans to perform regular LTS Census data releases:

* Published online every six months for public access, starting on May 15, 2023.
* Available for public access for at least 5 years upon publication.

The latest LTS Census data release is the default opened by the APIs and recognized as the `census_version = "stable"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to caveat/footnote that "stable" will change every time a new LTS is released, and the perma-name for it is 2023-05-15

### LTS 2023-05-15

Open this data release by specifying `census_version = "2023-06-28"` in future calls to `open_soma()`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date mismatch - it does not specify the LTS: s/2023-06-28/2023-05-15/

#### 🔴 Erratum 🔴

There are 243,569 number of cells labelled as `is_primary_data = True` for which a fraction of them the label is incorrect.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard to understand wording...

Perhaps: There are 243,569 cells labelled as is_primary_data = True, which have a duplicate cell labelled the same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're being specific about the 243,569 observations labeled, then I would be specific about how many of those are incorrect, instead of saying "a fraction".
And just to be clear, we still have 2 cases that are being reviewed so these numbers may change. And as such, I didn't review the specific counts, etc.

@@ -1,23 +1,110 @@
# Census data releases

**Last edited**: April, 2023.
**Last edited**: July, 2023.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend full date in case it is edited more than once in a given month

Open this data release by specifying `census_version = "2023-06-28"` in future calls to `open_soma()`.

#### 🔴 Erratum 🔴
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using plural form, i.e., Errata

seems unlikely we will only have one in due time :-)

@codecov
Copy link

codecov bot commented Jul 4, 2023

Codecov Report

Merging #600 (47dc864) into main (f06bceb) will decrease coverage by 0.03%.
The diff coverage is n/a.

❗ Current head 47dc864 differs from pull request most recent head 324a21a. Consider uploading reports for the commit 324a21a to get more accurate results

@@            Coverage Diff             @@
##             main     #600      +/-   ##
==========================================
- Coverage   88.10%   88.07%   -0.03%     
==========================================
  Files          62       62              
  Lines        3740     3741       +1     
==========================================
  Hits         3295     3295              
- Misses        445      446       +1     
Flag Coverage Δ
unittests 88.07% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

There are 243,569 number of cells labelled as `is_primary_data = True` for which a fraction of them the label is incorrect.

Such label indicates that a cell is the primary representation of an observation, otherwise a cell is deemed to be a duplicate representation. Based on their count vectors, these 243,569 number of cells are represented at least twice with `is_primary_data = True`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such label indicates that a cell is the primary representation of an observation - I think this will introduce confusion (What does "primary representation" mean?).
Based on their count vectors - I don't think it matters how we found them

Suggest something like...
In order to prevent duplicate data in analyses, each observation should be marked is_primary data = True exactly once in the Census. Since this LTS release, 243,569 observations have been identified that are represented at least twice with is_primary_data = True.

* Available for public access for 1 week or until the next latest release is performed, whichever is the longest.

The weekly release can be opened by the APIs by specifying `census_version = "latest"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...The most recent weekly release...

(other weekly releases must be opened by name)

* Available for public access for 1 week or until the next latest release is performed, whichever is the longest.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are available for a month by explicit name

**Contents**

1. [What is a Census data release?](#What-is-a-Census-data-release)
2. [List of LTS Census data releases](#List-of-LTS-Census-data-releases)

## What is a Census data release?

It is a Census build that is publicly hosted online. A Census build is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth stating: Any given Census build is named with a unique tag, normally the date of build, e.g., 2020-01-30

Copy link
Contributor

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left a couple of optional nit comments - fix as you see fit.

@pablo-gar pablo-gar merged commit 5ac97a0 into main Jul 7, 2023
@pablo-gar pablo-gar deleted the pablo-gar/update-data-release-page-doc-site branch July 7, 2023 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants