Skip to content

Commit

Permalink
Updates Readme (#212)
Browse files Browse the repository at this point in the history
Readme now represents a splash with enough information to onboard a new user to the Cell Census
---------

Co-authored-by: Ambrose J Carr <[email protected]>
  • Loading branch information
pablo-gar and ambrosejcarr authored Feb 27, 2023
1 parent 05aa12a commit ace0551
Show file tree
Hide file tree
Showing 2 changed files with 304 additions and 11 deletions.
226 changes: 215 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,229 @@
[![codecov](https://codecov.io/gh/chanzuckerberg/cell-census/branch/main/graph/badge.svg?token=byX1pyDlc9)](https://codecov.io/gh/chanzuckerberg/cell-census)

# CELLxGENE Cell Census
# Cell Census of CZ CELLxGENE Discover

**Status**: Unstable, under rapid development
[**CZ CELLxGENE Discover**](https://cellxgene.cziscience.com/) is a free-to-use data portal hosting a growing corpus of more than **700 single-cell datasets** comprising about **50 million cells** from the major human and mouse tissues. The portal provides a set of visual tools to download and explore the data. **All data is [standardized](https://github.com/chanzuckerberg/single-cell-curation/tree/main/schema/3.0.0)** to include raw counts and a common vocabulary for gene and cell metadata.

This repository contains documentation and example code related to the Chan Zuckerberg CELLxGENE Cell Census, and a client (API) package to simplify accessing the Cell Census data.
The **Cell Census** provides efficient computational tooling to access, query, and analyze all single-cell RNA data from CZ CELLxGENE Discover. Using a **new access paradigm of cell-based slicing and querying**, you can interact with the data through [TileDB-SOMA](https://github.com/single-cell-data/TileDB-SOMA), or get slices in [AnnData](https://anndata.readthedocs.io/) or [Seurat](https://satijalab.org/seurat/) objects.

The CZ Cell Census is an aggregation of all public single cell data available in [CELLxGENE Discover](https://cellxgene.cziscience.com/), published in API-accessible formats, including the [SOMA API](https://github.com/single-cell-data/).
Get started on using the Cell Census:

### Reuse
- [Quick start](#Quick-start).
- [Documentation](https://chanzuckerberg.github.io/cell-census/index.html). *Under development.*
- [Python tutorials](https://github.com/chanzuckerberg/cell-census/tree/main/api/python/notebooks).
- R tutorials. *Coming soon.*

The contents of this Github repository are freely available for reuse under the [MIT license](https://opensource.org/licenses/MIT). Data in the CZ Cell Census are available for re-use under the [CC-BY license](https://creativecommons.org/licenses/by/4.0/).
## Technology

### For More Help
The Cell Census is a data object publicly hosted online and a convenience API to open it. The object is built using the [SOMA](https://github.com/single-cell-data/SOMA) API and data model via its implementation [TileDB-SOMA](https://github.com/single-cell-data/TileDB-SOMA). As such, the Cell Census has all the data capabilities offered by TileDB-SOMA including:

Please file a issue on this repo, or contact us at <[email protected]>
- Cloud-based data storage and access.
- Efficient access for larger-than-memory slices of data.
- Data streaming for iterative/parallelizable methods.
- R and Python support.
- Export to AnnData and Seurat.

### Security
## Cell Census data releases

If you believe you have found a security issue, we would appreciate notification. Please send email to <[email protected]>.
Starting in mid 2023, Cell Census long-term supported data builds will be released every 6 months and will be publicly accessible for at least 5 years after release.

### Code of Conduct
In between long-term supported data build releases, weekly builds will be released without any guarantee of permanence.

## Cell Census data organization

The Cell Census follows a specific [data schema](https://github.com/chanzuckerberg/cell-census/blob/main/docs/cell_census_schema.md). Briefly, the Cell Census is a collection of a variety of **[SOMA objects](https://github.com/single-cell-data/SOMA/blob/main/abstract_specification.md#foundational-types)** organized with the following hierarchy.

<img src="./docs/cell_census_data_model.svg">


## Quick start

### Requirements

The Cell Census requires a Linux or MacOS system with:

- Python 3.7 to Python 3.10. Or R, supported versions TBD.
- Recommended: >16 GB of memory.
- Recommended: >5 Mbps internet connection.
- Recommended: for increased performance use the API through a AWS-EC2 instance from the region `us-west-2`. The Cell Census data builds are hosted in a AWS-S3 bucket in that region.

### Documentation

The Cell Census [doc-site](https://chanzuckerberg.github.io/cell-census/index.html) (*under development*), contains the reference documentation, data description, and tutorials.

Reference documentation can also be accessed directly from Python or R.


### Python quick start

#### Installation

It is recommended to install the Cell Census and all of its dependencies in a new virtual environment via `pip`:

```
pip install -U cell-census
```

#### Usage examples

Demonstration notebooks can be found [here](https://github.com/chanzuckerberg/cell-census/tree/main/api/python/notebooks).

Below are 3 examples of common operations you can do with the Cell Census. As a reminder, the reference documentation for the API can be accessed via `help()`:

```python
import cell_census

help(cell_census)
help(cell_census.get_anndata)
# etc
```

##### Querying a slice of cell metadata.

The following reads the cell metadata and filters `female` cells of cell type `microglial cell` or `neuron`, and selects the columns `assay`, `cell_type`, `tissue`, `tissue_general`, `suspension_type`, and `disease`.

```python
import cell_census

with cell_census.open_soma() as census:

# Reads SOMADataFrame as a slice
cell_metadata = census["census_data"]["homo_sapiens"].obs.read(
value_filter = "sex == 'female' and cell_type in ['microglial cell', 'neuron']",
column_names = ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]
)

# Concatenates results to pyarrow.Table
cell_metadata = cell_metadata.concat()

# Converts to pandas.DataFrame
cell_metadata = cell_metadata.to_pandas()

print(cell_metadata)
```

The output is a `pandas.DataFrame` with about 300K cells meeting our query criteria and the selected columns.

```bash
assay cell_type tissue tissue_general suspension_type disease sex
0 10x 3' v3 microglial cell eye eye cell normal female
1 10x 3' v3 microglial cell eye eye cell normal female
2 10x 3' v3 microglial cell eye eye cell normal female
3 10x 3' v3 microglial cell eye eye cell normal female
4 10x 3' v3 microglial cell eye eye cell normal female
... ... ... ... ... ... ... ...
299617 10x 3' v3 neuron cerebral cortex brain nucleus normal female
299618 10x 3' v3 neuron cerebral cortex brain nucleus normal female
299619 10x 3' v3 neuron cerebral cortex brain nucleus normal female
299620 10x 3' v3 neuron cerebral cortex brain nucleus normal female
299621 10x 3' v3 neuron cerebral cortex brain nucleus normal female

[299622 rows x 7 columns]
```

##### Obtaining a slice as AnnData

The following creates an `anndata.AnnData` object on-demand with the same cell filtering criteria as above and filtering only the genes `ENSG00000161798`, `ENSG00000188229`.

```python
import cell_census

with cell_census.open_soma() as census:
adata = cell_census.get_anndata(
census = census,
organism = "Homo sapiens",
var_value_filter = "feature_id in ['ENSG00000161798', 'ENSG00000188229']",
obs_value_filter = "sex == 'female' and cell_type in ['microglial cell', 'neuron']",
column_names = {"obs": ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]},
)

print(adata)

```

The output with about 300K cells and 2 genes can be now used for downstream analysis using [scanpy](https://scanpy.readthedocs.io/en/stable/).

``` bash
AnnData object with n_obs × n_vars = 299622 × 2
obs: 'assay', 'cell_type', 'tissue', 'tissue_general', 'suspension_type', 'disease', 'sex'
var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'
```

##### Memory-efficient queries

This example provides a demonstration to access the data for larger-than-memory operations using **TileDB-SOMA** operations.

First we initiate a lazy-evaluation query to access all brain and male cells from human. This query needs to be closed — `query.close()` — or used called in a context manager — `with ...`.

```python
import cell_census

with cell_census.open_soma() as census:

human = census["census_data"]["homo_sapiens"]
query = human.axis_query(
measurement_name = "RNA",
obs_query = tiledbsoma.AxisQuery(
value_filter = "tissue == 'brain' and sex == 'male'"
)

# Continued below

```

Now we can iterate over the matrix count, as well as the cell and gene metadata. For example, to iterate over the matrix count, we can get an iterator and perform operations for each iteration.

```python
# Continued from above

iterator = query.X("raw").tables()

# Get an iterative slice as pyarrow.Table
raw_slice = next (iterator)
...
```

And you can now perform operations on each iteration slice. As with any any Python iterator this logic can be wrapped around a `for` loop.
And you must close the query.
```
# Continued from above
query.close()
```
### R quick start
*Coming soon.*
## Questions, feedback and issues
- Questions: we encourage you to ask questions via [github issues](https://github.com/chanzuckerberg/cell-census/issues). Alternatively, for quick support you can join the [CZI Science Community](https://join-cellxgene-users.herokuapp.com/) on Slack and join the `#cell-census-users` channel
- Bugs: please submit a [github issue](https://github.com/chanzuckerberg/cell-census/issues).
- Security issues: if you believe you have found a security issue, in lieu of filing an issue please responsibly disclose it by contacting <security@chanzuckerberg.com>.
- You can send any other feedback to <soma@chanzuckerberg.com>
## Coming soon
- R support!
- We are currently working on creating the tooling necessary to perform data modeling at scale with seamless integration of the Cell Census and [PyTorch](https://pytorch.org/).
- To increase the usability of the Cell Census for research, in 2023 and 2024 we are planning to explore the following areas :
- Include organism-wide normalized layers.
- Include Organism-wide embeddings.
- On-demand information-rich subsampling.
## Projects and tools using the Cell Census
If you are interested in listing a project here, please reach out to us at <soma@chanzuckerberg.com>

## Reuse

The contents of this Github repository are freely available for reuse under the [MIT license](https://opensource.org/licenses/MIT). Data in the Cell Census are available for re-use under the [CC-BY license](https://creativecommons.org/licenses/by/4.0/).


## Code of Conduct

This project adheres to the Contributor Covenant [code of conduct](https://github.com/chanzuckerberg/.github/blob/master/CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. Please report unacceptable behavior to <opensource@chanzuckerberg.com>.

Loading

0 comments on commit ace0551

Please sign in to comment.