Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds 'duplicate cells' notebooks #448

Merged
merged 10 commits into from
May 2, 2023

Conversation

pablo-gar
Copy link
Contributor

Adds notebook to increase visibility to the cell metadata variable is_primary_data

@codecov
Copy link

codecov bot commented May 2, 2023

Codecov Report

Merging #448 (a1d281b) into main (113b596) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #448   +/-   ##
=======================================
  Coverage   88.51%   88.51%           
=======================================
  Files          50       50           
  Lines        2770     2770           
=======================================
  Hits         2452     2452           
  Misses        318      318           
Flag Coverage Δ
unittests 88.51% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@pablo-gar pablo-gar marked this pull request as ready for review May 2, 2023 04:39
Copy link
Collaborator

@atolopko-czi atolopko-czi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions

"\n",
"* There are superset datasets containing data from multiple datasets.\n",
"> *For example [Tabula Sapiens](https://cellxgene.cziscience.com/collections/e5f58829-1a66-40b5-a624-9046778e74f5) has one dataset with all of its cells and separate datasets with cells divided by high-level lineage (i.e. immune, epithelial, stromal, endothelial)*\n",
"* There are datasets with meta-analysis of pre-existing datasets.\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"* There are datasets with meta-analysis of pre-existing datasets.\n",
"* A dataset may provide a meta-analysis of a pre-existing datasets.\n",

}
],
"source": [
"adata"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(adata.obs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

"outputs": [],
"source": [
"with cellxgene_census.open_soma() as census:\n",
" nk_cells_unique = census[\"census_data\"][\"homo_sapiens\"].obs.read(\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not nk_cells_primary instead of introducing a new term (unique)? (for adata below, too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

" )\n",
"\n",
" # get iterator for X\n",
" iterator = query.X(\"raw\").tables()\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternately, you could retrieve the obs data, concat it, and show the count, as done for the other examples. This we de-emphasize the "out-of-core" purpose. Or retrieve the obs data and show the is_primary_data values for a single chunk are all True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to still emphasize the out-of-core aspect of this, the reason being "repetition is our ally" for communicating to the user how to perform out-of-core and exclude primary cells

pablo-gar and others added 7 commits May 2, 2023 09:48
Co-authored-by: Andrew Tolopko <[email protected]>
Co-authored-by: Andrew Tolopko <[email protected]>
Co-authored-by: Andrew Tolopko <[email protected]>
Co-authored-by: Andrew Tolopko <[email protected]>
Co-authored-by: Andrew Tolopko <[email protected]>
Co-authored-by: Andrew Tolopko <[email protected]>
"\n",
"## An example: duplicate cells in the Tabula Muris Senis data\n",
"\n",
"Let's take a look at an example from the Census using the Tabula Muris Senis data. Some datasets contain non-primary cell data.\n",
Copy link
Contributor

@bkmartinjr bkmartinjr May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this introduces a new term: "non-primary cell data". Is this the same as "duplicated cell data"? If the same, I suggest sticking with the defined term ("duplicate"), or defining primary/non-primary in the intro.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

"\n",
"Let's take a look at an example from the Census using the Tabula Muris Senis data. Some datasets contain non-primary cell data.\n",
"\n",
"We can obtain cell metadata for the **main** Tabula Muris Senis dataset: \"All - A single-cell transcriptomic atlas characterizes ageing tissues in the mouse - 10x\", which contains only primary cell data\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue here - the term "primary" is used. If these are helpful, maybe we just need to add them to the definition above (i.e., what is a "primary" or "non-primary" cell?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

"1. This dataset only contains cells from liver.\n",
"2. All cells are labelled as `False` for `is_primary_data`. **This is because the cells are marked as duplicate cells of the main Tabula Muris Senis dataset.**\n",
"\n",
"## Filtering out duplicates cells\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo (extra 's'): suggest: "duplicate cells"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Copy link
Contributor

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of minor suggestions

Copy link
Collaborator

@atolopko-czi atolopko-czi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@pablo-gar pablo-gar merged commit 8962e77 into main May 2, 2023
@pablo-gar pablo-gar deleted the pablo-gar/add-duplicate-cell-notebooks branch May 2, 2023 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants