Schema

Document Status: Approved

Version: 5.2.0

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED" "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC2119, and RFC8174 when, and only when, they appear in all capitals, as shown here.

Schema versioning

The CELLxGENE schema version is based on Semantic Versioning.

Major version is incremented when schema updates are incompatible with the AnnData and Seurat data encodings or CELLxGENE API(s). Examples include:

Renaming metadata fields
Deprecating metadata fields
Changing the type or format of a metadata field

Minor version is incremented when schema updates may require changes only to the cellxgene-schema CLI or the curation process. Examples include:

Adding metadata fields
Updating pinned ontologies or gene references
Changing the validation requirements for a metadata field

Patch version is incremented for editorial updates to the schema.

All changes are documented in the schema Changelog.

Background

CELLxGENE aims to support the publication, sharing, and exploration of single-cell datasets. Building on those published datasets, CELLxGENE seeks to create references of the phenotypes and composition of cells that make up human tissues.

Creating references from multiple datasets requires some harmonization of metadata and features, but if that harmonization is too onerous, it will burden the goal of rapid data sharing. CELLxGENE balances publishing and reference creation needs by requiring datasets hosted by CELLxGENE Discover to include a small set of metadata readily available from data submitters.

This document describes the schema, a type of contract, that CELLxGENE requires all datasets to adhere to so that it can enable searching, filtering, and integration of datasets it hosts.

Note that the requirements in the schema are just the minimum required information. Datasets often have additional metadata, which is preserved in datasets submitted to CELLxGENE Discover.

Overview

This schema supports multiple assay types. Each assay takes the form of one or more two-dimensional matrices whose values are quantitative measures of the phenotypes of cells.

The schema additionally describes how the dataset, genes, and cells are annotated to describe the biological and technical characteristics of the data.

This document is organized by:

General requirements
X (Matrix layers), which describe the data required for different assays
obs (Cell metadata), which describe each cell in the dataset
obsm (Embeddings), which describe each embedding in the dataset
obsp, which describe pairwise annotation of observations
var and raw.var (Gene metadata), which describe each gene in the dataset
varm, which describe multi-dimensional annotation of variables/features
varp, which describe pairwise annotation of variables/features
uns (Dataset metadata), which describe the dataset as a whole

General Requirements

AnnData. The canonical data format for CELLxGENE Discover is HDF5-backed AnnData as written by AnnData version 0.8.0 or greater. The on-disk format must be AnnData specification (v0.1.0). Part of the rationale for selecting this format is to allow CELLxGENE to access both the data and metadata within a single file. The schema requirements and definitions for the AnnData X, obs, var, raw.var, obsm, and uns attributes are described below.

All data submitted to CELLxGENE Discover is automatically converted to a Seurat V5 object that can be loaded by the R package Seurat. See the Seurat encoding for further information.

Organisms. Data MUST be from a Metazoan organism or SARS-COV-2 and defined in the NCBI organismal classification. For data that is neither Human, Mouse, nor SARS-COV-2, features MUST be translated into orthologous genes from the pinned Human and Mouse gene annotations.

Reserved Names. The names of metadata fields MUST NOT start with "__". The names of the metadata fields specified by the schema are reserved for the purposes and specifications described in the schema.

Unique Names. The names of schema and data submitter metadata fields in obs and var MUST be unique. For example, duplicate "feature_biotype" keys in AnnData var are not allowed.

Reserved Names from previous schema versions that have since been deprecated MUST NOT be present in datasets:

Reserved Name	AnnData	Deprecated in
ethnicity	obs	3.0.0
ethnicity_ontology_term_id	obs	3.0.0
X_normalization	uns	3.0.0
default_field	uns	2.0.0
layer_descriptions	uns	2.0.0
tags	uns	2.0.0
version	uns	2.0.0
contributors	uns	1.1.0
preprint_doi	uns	1.1.0
project_description	uns	1.1.0
project_links	uns	1.1.0
project_name	uns	1.1.0
publication_doi	uns	1.1.0

Redundant Metadata. It is STRONGLY RECOMMENDED to avoid multiple metadata fields containing identical or similar information.

No Personal Identifiable Information (PII). This is not strictly enforced by validation because it is difficult for software to predict what is and is not PII; however, curators MUST agree to the data submission policies of CELLxGENE Discover on behalf of data submitters which includes this requirement:

It is my responsibility to ensure that this data is not identifiable. In particular, I commit that I will remove any direct personal identifiers in the metadata portions of the data, and that CZI may further contact me if it believes more work is needed to de-identify it.

This includes names, emails, or other PII for researchers or curators involved in the data generation and submission.

Note on types

The types below are python3 types. Note that a python3 str is a sequence of Unicode code points, which is stored null-terminated and UTF-8-encoded by AnnData.

`X` (Matrix Layers)

The data stored in the X data matrix is the data that is viewable in CELLxGENE Explorer. CELLxGENE does not impose any additional constraints on the X data matrix.

In any layer, if a matrix has 50% or more values that are zeros, it is STRONGLY RECOMMENDED that the matrix be encoded as a scipy.sparse.csr_matrix with zero values encoded as implicit zeros.

CELLxGENE's matrix layer requirements are tailored to optimize data reuse. Because each assay has different characteristics, the requirements differ by assay type. In general, CELLxGENE requires submission of "raw" data suitable for computational reuse when a standard raw matrix format exists for an assay. It is STRONGLY RECOMMENDED to also include a "normalized" matrix with processed values ready for data analysis and suitable for visualization in CELLxGENE Explorer. So that CELLxGENE's data can be provided in download formats suitable for both R and Python, the schema imposes the following requirements:

All matrix layers MUST have the same shape, and have the same cell labels and gene labels.
Because it is impractical to retain all barcodes in raw and normalized matrices, any cell filtering MUST be applied to both. By contrast, those wishing to reuse datasets require access to raw gene expression values, so genes SHOULD NOT be filtered from either dataset. Summarizing, any cell barcodes that are removed from the data MUST be filtered from both raw and normalized matrices and genes SHOULD NOT be filtered from the raw matrix.
Any genes that publishers wish to filter from the normalized matrix MAY have their values replaced by zeros and MUST be flagged in the column feature_is_filtered of var, which will mask them from exploration.
Additional layers provided at author discretion MAY be stored using author-selected keys, but MUST have the same cells and genes as other layers. It is STRONGLY RECOMMENDED that these layers have names that accurately summarize what the numbers in the layer represent (e.g. "counts_per_million", "SCTransform_normalized", or "RNA_velocity_unspliced").

The following table describes the matrix data and layers requirements that are assay-specific. If an entry in the table is empty, the schema does not have any other requirements on data in those layers beyond the ones listed above.

Assay	"raw" required?	"raw" location	"normalized" required?	"normalized" location
scRNA-seq (UMI, e.g. 10x v3, Slide-seqV2)	REQUIRED. Values MUST be de-duplicated molecule counts. Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as `numpy.float32`.	`AnnData.raw.X` unless no "normalized" is provided, then `AnnData.X`	STRONGLY RECOMMENDED	`AnnData.X`
Visium Spatial Gene Expression	REQUIRED. Values MUST be de-duplicated molecule counts. All non-zero values MUST be positive integers stored as `numpy.float32`. If `uns['spatial']['is_single']` is `False` then each cell MUST contain at least one non-zero value. If `uns['spatial']['is_single']` is `True` then the unfiltered feature-barcode matrix (`raw_feature_bc_matrix`) MUST be used. See Space Ranger Feature-Barcode Matrices. This matrix MUST contain 4992 rows. If the `obs['in_tissue']` value is `1`, then the cell MUST contain at least one non-zero value. If any `obs['in_tissue']` values are `0`, then at least one cell corresponding to a `obs['in_tissue']` with a value of `0` MUST contain a non-zero value.	`AnnData.raw.X` unless no "normalized" is provided, then `AnnData.X`	STRONGLY RECOMMENDED	`AnnData.X`

scRNA-seq (non-UMI, e.g. SS2)	REQUIRED. Values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM). Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as `numpy.float32`.	`AnnData.raw.X` unless no "normalized" is provided, then `AnnData.X`	STRONGLY RECOMMENDED	`AnnData.X`
Accessibility (e.g. ATAC-seq, mC-seq)	NOT REQUIRED		REQUIRED	`AnnData.X`

Integration Metadata

CELLxGENE requires ontology terms to enable search, comparison, and integration of data. Ontology terms for cell metadata MUST use OBO-format identifiers, meaning a CURIE (prefixed identifier) of the form Ontology:Identifier. For example, EFO:0000001 is a term in the Experimental Factor Ontology (EFO).

The most accurate ontology term MUST always be used. If an exact or approximate ontology term is not available, a new term may be requested:

For the Cell Ontology, data submitters may suggest a new term and notify the curation team of the pending term request, so that the datasets can be updated once the term is available.

To meet CELLxGENE schema requirements, the most accurate available CL term MUST be used until the new term is available. For example if cell_type_ontology_term_id describes a relay interneuron, but the most accurate available term in the CL ontology is CL:0000099 for interneuron, then the interneuron term can be used to fulfill this requirement and ensures that users searching for "neuron" are able to find these data. If no appropriate term can be found (e.g. the cell type is unknown), then "unknown" MUST be used. Users will still be able to access more specific cell type annotations that have been submitted with the dataset (but aren't required by the schema).
For all other ontologies, data submitters may submit a request to the curation team during the submission process.

Terms documented as obsolete in an ontology MUST NOT be used. For example, EFO:0009310 for obsolete_10x v2 was marked as obsolete in EFO version 3.31.0 and replaced by EFO:0009899 for 10x 3' v2.

Required Ontologies

The following ontology dependencies are pinned for this version of the schema.

Ontology	OBO Prefix	Release	Download
Cell Ontology	CL	2024-08-16	cl.owl
Experimental Factor Ontology	EFO	2024-08-15 EFO 3.69.0	efo.owl
Human Ancestry Ontology	HANCESTRO	3.0	hancestro-base.owl
Human Developmental Stages	HsapDv	2024-05-28	hsapdv.owl
Mondo Disease Ontology	MONDO	2024-08-06	mondo.owl
Mouse Developmental Stages	MmusDv	2024-05-28	mmusdv.owl
NCBI organismal classification	NCBITaxon	2023-06-20	ncbitaxon.owl
Phenotype And Trait Ontology	PATO	2023-05-18	pato.owl
Uberon multi-species anatomy ontology	UBERON	2024-08-07	uberon.owl

Required Gene Annotations

ENSEMBL identifiers are required for genes and External RNA Controls Consortium (ERCC) identifiers for RNA Spike-In Control Mixes to ensure that all datasets measure the same features and can therefore be integrated.

The following gene annotation dependencies are pinned for this version of the schema. For multi-organism experiments, cells from any Metazoan organism are allowed as long as orthologs from the following organism annotations are used.

Source	Required version	Download
GENCODE (Human)	Human reference GRCh38.p14 (GENCODE v44/Ensembl 110)	gencode.v44.primary_assembly.annotation.gtf
GENCODE (Mouse)	Mouse reference GRCm39 (GENCODE vM33/Ensembl 110)	gencode.vM33.primary_assembly.annotation.gtf
ENSEMBL (COVID-19)	SARS-CoV-2 reference (ENSEMBL assembly: ASM985889v3)	Sars_cov_2.ASM985889v3.101.gtf
ThermoFisher ERCC Spike-Ins	ThermoFisher ERCC RNA Spike-In Control Mixes (Cat # 4456740, 4456739)	cms_095047.txt

`obs` (Cell Metadata)

obs is a pandas.DataFrame.

Curators MUST annotate the following columns in the obs dataframe:

index of pandas.DataFrame

Key	index of `pandas.DataFrame`
Annotator	Curator MUST annotate.
Value	`str`. The index of the pandas.DataFrame MUST contain unique identifiers for observations.

array_col

Key	array_col
Annotator	Curator MUST annotate if `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression and `uns['spatial']['is_single']` is `True`; otherwise, this key MUST NOT be present.
Value	`int`. This MUST be the value of the column coordinate for the corresponding spot from the `array_col` field in `tissue_positions_list.csv` or `tissue_positions.csv`. The value MUST be in the range between `0` and `127`. See Space Ranger Spatial Outputs.

array_row

Key	array_row
Annotator	Curator MUST annotate if `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression and `uns['spatial']['is_single']` is `True`; otherwise, this key MUST NOT be present.
Value	`int`. This MUST be value of the row coordinate for the corresponding spot from the `array_row` field in in `tissue_positions_list.csv` or `tissue_positions.csv`. The value MUST be in the range between `0` and `77`. See Space Ranger Spatial Outputs.

assay_ontology_term_id

Key assay_ontology_term_id

Annotator Curator MUST annotate.

Value

categorical with str categories. This MUST be an EFO term and either:

the most accurate descendant of "EFO:0002772" for assay by molecule
the most accurate descendant of "EFO:0010183" for single cell library construction

If assay_ontology_term_id is either "EFO:0010961" for Visium Spatial Gene Expression or "EFO:0030062" for Slide-seqV2 then all observations MUST contain the same value.

An assay based on 10X Genomics products SHOULD either be "EFO:0008995" for 10x technology or preferably its most accurate descendant. An assay based on SMART (Switching Mechanism at the 5' end of the RNA Template) or SMARTer technology SHOULD either be "EFO:0010184" for Smart-like or preferably its most accurate descendant.

Recommended values for specific assays:

For	Use
10x 3' v2	`"EFO:0009899"`
10x 3' v3	`"EFO:0009922"`
10x 5' v1	`"EFO:0011025"`
10x 5' v2	`"EFO:0009900"`
Smart-seq2	`"EFO:0008931"`
Visium Spatial Gene Expression	`"EFO:0010961"`

cell_type_ontology_term_id

Key	cell_type_ontology_term_id
Annotator	Curator MUST annotate.
Value	categorical with `str` categories. This MUST be a CL term or `"unknown"`. It MUST be `"unknown"` when: no appropriate term can be found (e.g. the cell type is unknown) `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression, `uns['spatial']['is_single']` is `True`, and the corresponding value of `in_tissue` is `0` The following terms MUST NOT be used: `"CL:0000255"` for eukaryotic cell `"CL:0000257"` for Eumycetozoan cell `"CL:0000548"` for animal cell

development_stage_ontology_term_id

Key development_stage_ontology_term_id

Annotator Curator MUST annotate.

Value

categorical with str categories. If unavailable, this MUST be "unknown".

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, this MUST be the most accurate descendant of HsapDv:0000001 for life cycle with the following STRONGLY RECOMMENDED:

For	Use
Embryonic stage	A term from the set of Carnegie stages 1-23 (up to 8 weeks after conception; e.g. HsapDv:0000003)
Fetal development	A term from the set of 9 to 38 week post-fertilization human stages (9 weeks after conception and before birth; e.g. HsapDv:0000046)
After birth for the first 12 months	A term from the set of 1 to 12 month-old human stages (e.g. HsapDv:0000273)
After the first 12 months post-birth	A term from the set of year-old human stages (e.g. HsapDv:0000246)

If organism_ontolology_term_id is "NCBITaxon:10090" for Mus musculus, this MUST be the accurate descendant of MmusDv:0000001 for life cycle with the following STRONGLY RECOMMENDED:

For	Use
From the time of conception to 1 month after birth	A term from the set of Theiler stages (e.g. MmusDv:0000003)
From 2 months after birth	A term from the set of month-old stages (e.g. MmusDv:0000062)

Otherwise, for all other organisms this MUST be the most accurate descendant of UBERON:0000105 for life cycle stage, excluding UBERON:0000071 for death stage.

disease_ontology_term_id

Key	disease_ontology_term_id
Annotator	Curator MUST annotate.
Value	categorical with `str` categories. This MUST be one of: `"PATO:0000461"` for normal or healthy. the most accurate descendant of `"MONDO:0000001"` for disease `"MONDO:0021178"` for injury or preferably its most accurate descendant

donor_id

Key	donor_id
Annotator	Curator MUST annotate.
Value	categorical with `str` categories. This MUST be free-text that identifies a unique individual that data were derived from. It is STRONGLY RECOMMENDED that this identifier be designed so that it is unique to: a given individual within the collection of datasets that includes this dataset a given individual across all collections in CELLxGENE Discover It is STRONGLY RECOMMENDED that `"pooled"` be used for observations from a sample of multiple individuals that were not confidently assigned to a single individual through demultiplexing. It is STRONGLY RECOMMENDED that `"unknown"` ONLY be used for observations in a dataset when it is not known which observations are from the same individual.

in_tissue

Key	in_tissue
Annotator	Curator MUST annotate if `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression and `uns['spatial']['is_single']` is `True`; otherwise, this key MUST NOT be present.
Value	`int`. This MUST be the value for the corresponding spot from the `in_tissue` field in `tissue_positions_list.csv` or `tissue_positions.csv` which is either `0` if the spot falls outside tissue or `1` if the spot falls inside tissue. See Space Ranger Spatial Outputs.

is_primary_data

Key	is_primary_data
Annotator	Curator MUST annotate.
Value	`bool`. This MUST be `False` if `uns['spatial']['is_single']` is `False`. This MUST be `True` if this is the canonical instance of this cellular observation and `False` if not. This is commonly `False` for meta-analyses reusing data or for secondary views of data.

organism_ontology_term_id

Key	organism_ontology_term_id
Annotator	Curator MUST annotate.
Value	categorical with `str` categories. This MUST be a descendant of NCBITaxon:33208 for Metazoa.

self_reported_ethnicity_ontology_term_id

Key	self_reported_ethnicity_ontology_term_id
Annotator	Curator MUST annotate.
Value	categorical with `str` categories. If `organism_ontolology_term_id` is `"NCBITaxon:9606"` for Homo sapiens, the value MUST be formatted as one or more comma-separated (with no leading or trailing spaces) HANCESTRO terms in ascending lexical order with no duplication of terms or `"unknown"` if unavailable. For example, if the terms are `"HANCESTRO:0014` and `HANCESTRO:0005"` then the value of `self_reported_ethnicity_ontology_term_id` MUST be `"HANCESTRO:0005,HANCESTRO:0014"`. The following terms MUST NOT be used: `"HANCESTRO:0002"` for regions and its descendants `"HANCESTRO:0003"` for country `"HANCESTRO:0004"` for ancestry category `"HANCESTRO:0018"` for uncategorised population `"HANCESTRO:0290"` for genetically isolated population `"HANCESTRO:0304"` for ancestry status and its descendants `"HANCESTRO:0323"` for Finnish founder `"HANCESTRO:0324"` for Dutch founder `"HANCESTRO:0551"` for genetically homogenous Irish `"HANCESTRO:0554"` for Silk Road founder `"HANCESTRO:0555"` for Arab Israeli founder `"HANCESTRO:0557"` for Costa Rican founder `"HANCESTRO:0558"` for French Canadian founder `"HANCESTRO:0559"` for Italian founder `"HANCESTRO:0560"` for Northern Finnish founder `"HANCESTRO:0561"` for Romanian founder `"HANCESTRO:0564"` for Vis founder `"HANCESTRO:0565"` for Split founder `"HANCESTRO:0566"` for undefined ancestry population The imported GEO term `"GEO:000000374"` for continent and its descendants: `"HANCESTRO:0029"` for Africa `"HANCESTRO:0030"` for Asia `"HANCESTRO:0031"` for Europe `"HANCESTRO:0032"` for Oceania `"HANCESTRO:0033"` for Latin America and the Caribbean `"HANCESTRO:0034"` for Northern America Otherwise, for all other organisms the `str` value MUST be `"na"`.

sex_ontology_term_id

Key	sex_ontology_term_id
Annotator	Curator MUST annotate.
Value	categorical with `str` categories. This MUST be a descendant of PATO:0001894 for phenotypic sex or `"unknown"` if unavailable.

suspension_type

Key suspension_type

Annotator Curator MUST annotate.

Value

categorical with str categories. This MUST be "cell", "nucleus", or "na".

This MUST be the correct type for the corresponding assay:

For Assay	MUST Use
10x transcription profiling [`EFO:0030080`] and its descendants	`"cell"` or `"nucleus"`
ATAC-seq [`EFO:0007045`] and its descendants	`"nucleus"`
BD Rhapsody Targeted mRNA [`EFO:0700004`]	`"cell"`
BD Rhapsody Whole Transcriptome Analysis [`EFO:0700003`]	`"cell"`
CEL-seq2 [`EFO:0010010`]	`"cell"` or `"nucleus"`
DroNc-seq [`EFO:0008720`]	`"nucleus"`
Drop-seq [`EFO:0008722`]	`"cell"` or `"nucleus"`
GEXSCOPE technology [`EFO:0700011`]	`"cell"` or `"nucleus"`
inDrop [`EFO:0008780`]	`"cell"` or `"nucleus"`
MARS-seq [`EFO:0008796`]	`"cell"`
mCT-seq [`EFO:0030060`]	`"cell"` or `"nucleus"`
MERFISH [`EFO:0008992`]	`"na"`
methylation profiling by high throughput sequencing [`EFO:0002761`] and its descendants	`"nucleus"`
microwell-seq [`EFO:0030002`]	`"cell"`
Patch-seq [`EFO:0008853`]	`"cell"`
ScaleBio single cell RNA sequencing [`EFO:0022490`]	`"cell"` or `"nucleus"`
sci-Plex [`EFO:0030026`]	`"nucleus"`
sci-RNA-seq [`EFO:0010550`]	`"cell"` or `"nucleus"`
sci-RNA-seq3 [`EFO:0030028`]	`"cell"` or `"nucleus"`
Seq-Well [`EFO:0008919`] and its descendants	`"cell"`
Smart-like [`EFO:0010184`] and its descendants	`"cell"` or `"nucleus"`
spatial transcriptomics [`EFO:0008994`] and its descendants	`"na"`
SPLiT-seq [`EFO:0009919`]	`"cell"` or `"nucleus"`
STRT-seq [`EFO:0008953`]	`"cell"`
TruDrop [`EFO:0700010`]	`"cell"` or `"nucleus"`

If the assay does not appear in this table, the most appropriate value MUST be selected and the curation team informed during submission so that the assay can be added to the table.

tissue_type

Key	tissue_type
Annotator	Curator MUST annotate.
Value	categorical with `str` categories. This MUST be `"tissue"`, `"organoid"`, or `"cell culture"`.

tissue_ontology_term_id

Key	tissue_ontology_term_id
Annotator	Curator MUST annotate.
Value	categorical with `str` categories. If `tissue_type` is `"tissue"` or `"organoid"`, this MUST be the most accurate descendant of `UBERON:0001062` for anatomical entity. If `tissue_type` is `"cell culture"` this MUST follow the requirements for `cell_type_ontology_term_id.`

When a dataset is uploaded, CELLxGENE Discover MUST automatically add the matching human-readable name for the corresponding ontology term to the obs dataframe. Curators MUST NOT annotate the following columns.

assay

Key	assay
Annotator	CELLxGENE Discover MUST annotate.
Value	categorical with `str` categories. This MUST be the human-readable name assigned to the value of `assay_ontology_term_id`.

cell_type

Key	cell_type
Annotator	CELLxGENE Discover MUST annotate.
Value	categorical with `str` categories. This MUST be `"unknown"` if the value of `cell_type_ontology_term_id` is `"unknown"`; otherwise, this MUST be the human-readable name assigned to the value of `cell_type_ontology_term_id`.

development_stage

Key	development_stage
Annotator	CELLxGENE Discover MUST annotate.
Value	categorical with `str` categories. This MUST be `"unknown"` if the value of `development_stage_ontology_term_id` is `"unknown"`; otherwise, this MUST be the human-readable name assigned to the value of `development_stage_ontology_term_id`.

disease

Key	disease
Annotator	CELLxGENE Discover MUST annotate.
Value	categorical with `str` categories. This MUST be the human-readable name assigned to the value of `disease_ontology_term_id`.

When a dataset is uploaded, CELLxGENE Discover MUST annotate a unique observation identifier for each cell. Curators MUST NOT annotate the following column.

observation_joinid

Key	observation_joinid
Annotator	CELLxGENE Discover MUST annotate.
Value	`str`

organism

Key	organism
Annotator	CELLxGENE Discover MUST annotate.
Value	categorical with `str` categories. This MUST be the human-readable name assigned to the value of `organism_ontology_term_id`.

self_reported_ethnicity

Key	self_reported_ethnicity
Annotator	CELLxGENE Discover MUST annotate.
Value	categorical with `str` categories. This MUST be `"na"` if the value of `self_reported_ethnicity_ontology_term_id` is `"na"`. This MUST be `"unknown"` if the value of `self_reported_ethnicity_ontology_term_id` is `"unknown"`. Otherwise, this MUST be one or more comma-separated (with no leading or trailing spaces) human-readable names for the terms in `self_reported_ethnicity_ontology_term_id` in the same order. For example, if the value of `self_reported_ethnicity_ontology_term_id` is `"HANCESTRO:0005,HANCESTRO:0014"` then the value of `self_reported_ethnicity` is `"European,Hispanic or Latin American"`.

sex

Key	sex
Annotator	CELLxGENE Discover MUST annotate.
Value	categorical with `str` categories. This MUST be `"unknown"` if the value of `sex_ontology_term_id` is `"unknown"`; otherwise, this MUST be the human-readable name assigned to the value of `sex_ontology_term_id`.

tissue

Key	tissue
Annotator	CELLxGENE Discover MUST annotate.
Value	categorical with `str` categories. This MUST be the human-readable name assigned to the value of `tissue_ontology_term_id`.

`obsm` (Embeddings)

The value for each str key MUST be a numpy.ndarray of shape (n_obs, m), where n_obs is the number of rows in X and m >= 1.

To display a dataset in CELLxGENE Explorer, Curators MUST annotate one or more embeddings of at least two-dimensions (e.g. tSNE, UMAP, PCA, spatial coordinates) as numpy.ndarrays in obsm.

spatial

Key	spatial
Annotator	Curator MUST annotate if `uns['spatial']['is_single']` is `True`. Curator MAY annotate if `uns['spatial']['is_single']` is `False`. Otherwise, this key MUST NOT be present.
Value	`numpy.ndarray` with the following requirements MUST have the same number of rows as `X` and MUST include at least two columns MUST be a `numpy.dtype.kind` of `"f"`, `"i"`, or "`u"` MUST NOT contain any positive infinity (`numpy.inf`) or negative infinity (`numpy.NINF`) values MUST NOT contain all Not a Number (`numpy.nan`) values If `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression and `uns['spatial']['is_single']` is `True`, the array MUST be created from the corresponding `pxl_row_in_fullres` and `pxl_col_in_fullres` fields from `tissue_positions_list.csv` or `tissue_positions.csv`. See Space Ranger Spatial Outputs.

X_{suffix}

Key	X_{suffix} with the following requirements: {suffix} MUST be at least one character in length. The first character of {suffix} MUST be a letter of the alphabet and the remaining characters MUST be alphanumeric characters, `'_'`, `'-'`, or `'.'` (This is equivalent to the regular expression pattern `"^[a-zA-Z][a-zA-Z0-9_.-]$"`.) {suffix} MUST NOT be `"spatial"`. {suffix} is presented as text to users in the Embedding Choice* selector in CELLxGENE Explorer so it is STRONGLY RECOMMENDED that it be descriptive. See also `default_embedding` in `uns`.
Annotator	Curator MUST annotate if `assay_ontology_term_id` is neither `"EFO:0010961"` for Visium Spatial Gene Expression nor `"EFO:0030062"` for Slide-seqV2. Curator MAY annotate if `assay_ontology_term_id` is either `"EFO:0010961"` for Visium Spatial Gene Expression or `"EFO:0030062"` for Slide-seqV2.
Value	`numpy.ndarray` with the following requirements MUST have the same number of rows as `X` and MUST include at least two columns MUST be a `numpy.dtype.kind` of `"f"`, `"i"`, or "`u"` MUST NOT contain any positive infinity (`numpy.inf`) or negative infinity (`numpy.NINF`) values MUST NOT contain all Not a Number (`numpy.nan`) values

`obsp`

The size of the ndarray stored for a key in obsp MUST NOT be zero.

`var` and `raw.var` (Gene Metadata)

var and raw.var are both of type pandas.DataFrame.

Curators MUST annotate the following columns in the var dataframe and if present, the raw.var dataframe.

index of pandas.DataFrame

Key	index of `pandas.DataFrame`
Annotator	Curator MUST annotate.
Value	`str`. If the feature is a gene then this MUST be an ENSEMBL term. If the feature is a RNA Spike-In Control Mix then this MUST be an ERCC Spike-In identifier (e.g. `"ERCC-0003"`). The index of the `pandas.DataFrame` MUST contain unique identifiers for features. If present, the index of `raw.var` MUST be identical to the index of `var`.

Curators MUST annotate the following column only in the var dataframe. This column MUST NOT be present in raw.var:

feature_is_filtered

Key	feature_is_filtered
Annotator	Curator MUST annotate.
Value	`bool`. This MUST be `True` if the feature was filtered out in the normalized matrix (`X`) but is present in the raw matrix (`raw.X`). The value for all cells of the given feature in the normalized matrix MUST be `0`. Otherwise, this MUST be `False`.

Curators MUST NOT annotate the following columns in the var dataframe and if present, the raw.var dataframe.

When a dataset is uploaded, CELLxGENE Discover MUST automatically add the matching human-readable name for the corresponding feature biotype, identifier, and the NCBITaxon term for the reference organism to the var and raw.var dataframes. In addition, it MUST add the feature length and type.

feature_biotype

Key	feature_biotype
Annotator	CELLxGENE Discover MUST annotate.
Value	This MUST be `"gene"` or `"spike-in"`.

feature_length

Key	feature_length
Annotator	CELLxGENE Discover MUST annotate.
Value	`uint` number of base-pairs (bps). The value is the median of the lengths of isoforms, reusing the median calculation from GTFtools: a software package for analyzing various features of gene models.

feature_name

Key	feature_name
Annotator	CELLxGENE Discover MUST annotate.
Value	`str`. If the `feature_biotype` is `"gene"` then this MUST be the human-readable ENSEMBL gene name assigned to the feature identifier in `var.index`. If the `feature_biotype` is `"spike-in"` then this MUST be the ERCC Spike-In identifier appended with `" (spike-in control)"`.

feature_reference

Key feature_reference

Annotator CELLxGENE Discover MUST annotate.

Value

str. This MUST be the reference organism for a feature:

Reference Organism	MUST Use
Homo sapiens	`"NCBITaxon:9606"`
Mus musculus	`"NCBITaxon:10090"`
SARS-CoV-2	`"NCBITaxon:2697049"`
ERCC Spike-Ins	`"NCBITaxon:32630"`

feature_type

Key	feature_type
Annotator	CELLxGENE Discover MUST annotate.
Value	`str`. If the `feature_biotype` is `"gene"` then this MUST be the gene type assigned to the feature identifier in `var.index`. If the `feature_biotype` is `"spike-in"` then this MUST be `"synthetic"`. See GENCODE and Ensembl references.

`varm`

The size of the ndarray stored for a key in varm MUST NOT be zero.

`varp`

The size of the ndarray stored for a key in varp MUST NOT be zero.

`uns` (Dataset Metadata)

uns is a ordered dictionary with a str key. The data stored as a value for a key in uns MUST be True, False, None, or its size MUST NOT be zero.

Curators MUST annotate the following keys and values in uns:

spatial

Key	spatial
Annotator	Curator MUST annotate if `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression or `"EFO:0030062"` for Slide-seqV2; otherwise, this key MUST NOT be present.
Value	`dict`. The requirements for the key-value pairs are documented in the following sections: spatial['is_single'] spatial[library_id] spatial[library_id]['images'] spatial[library_id]['images']['fullres'] spatial[library_id]['images']['hires'] spatial[library_id]['scalefactors'] spatial[library_id]['scalefactors']['spot_diameter_fullres'] spatial[library_id]['scalefactors']['tissue_hires_scalef'] Additional key-value pairs MUST NOT be present.

is_single

Key	is_single
Annotator	Curator MUST annotate if `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression or `"EFO:0030062"` for Slide-seqV2; otherwise, this key MUST NOT be present.
Value	`bool`. This MUST be `True`: if `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression and the dataset represents one Space Ranger output for a single tissue section if `assay_ontology_term_id` is `"EFO:0030062"` for Slide-seqV2 and the dataset represents the output for a single array on a puck Otherwise, this MUST be `False`.

spatial[library_id]

Key	Identifier for the Visium library
Annotation	Curator MUST annotate if `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression and `uns['spatial']['is_single']` is `True`; otherwise, this key MUST NOT be present.
Value	`dict`. There MUST be only one `library_id`.

spatial[library_id]['images']

Key	images
Annotation	Curator MUST annotate if `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression and `uns['spatial']['is_single']` is `True`; otherwise, this key MUST NOT be present.
Value	`dict`

spatial[library_id]['images']['fullres']

Key	fullres
Annotation	Curator MAY annotate if `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression and `uns['spatial']['is_single']` is `True`; otherwise, this key MUST NOT be present.
Value	The full resolution image MUST be converted to a`numpy.ndarray` with the following requirements: The length of `numpy.ndarray.shape` MUST be `3` The `numpy.ndarray.dtype` MUST be `numpy.uint8` The `numpy.ndarray.shape[2]` MUST be either `3` (RGB color model for example) or `4` (RGBA color model for example)

spatial[library_id]['images']['hires']

Key	hires
Annotation	Curator MUST annotate if `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression and `uns['spatial']['is_single']` is `True`; otherwise, this key MUST NOT be present.
Value	`tissue_hires_image.png` MUST be converted to a`numpy.ndarray` with the following requirements: The length of `numpy.ndarray.shape` MUST be `3` The `numpy.ndarray.dtype` MUST be `numpy.uint8` The largest dimension in `numpy.ndarray.shape[:2]` MUST be `2000`pixels. See Space Ranger Spatial Outputs The `numpy.ndarray.shape[2]` MUST be either `3` (RGB color model for example) for `4` (RGBA color model for example)

spatial[library_id]['scalefactors']

Key	scalefactors
Annotation	Curator MUST annotate if `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression and `uns['spatial']['is_single']` is `True`; otherwise, this key MUST NOT be present.
Value	`dict`

spatial[library_id]['scalefactors']['spot_diameter_fullres']

Key	spot_diameter_fullres
Annotation	Curator MUST annotate if `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression and `uns['spatial']['is_single']` is `True`; otherwise, this key MUST NOT be present.
Value	`float`. This must be the value of the `spot_diameter_fullres` field from `scalefactors_json.json`. See Space Ranger Spatial Outputs.

spatial[library_id]['scalefactors']['tissue_hires_scalef']

Key	tissue_hires_scalef
Annotation	Curator MUST annotate if `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression and `uns['spatial']['is_single']` is `True`; otherwise, this key MUST NOT be present.
Value	`float`. This must be the value of the `tissue_hires_scalef` field from `scalefactors_json.json`. See Space Ranger Spatial Outputs.

title

Key	title
Annotator	Curator MUST annotate.
Value	`str`. This text describes and differentiates the dataset from other datasets in the same collection. It is displayed on a page in CELLxGENE Discover that also has the collection name. To illustrate, the first dataset name in the Cells of the adult human heart collection is "All — Cells of the adult human heart". It is STRONGLY RECOMMENDED that each dataset `title` in a collection is unique and does not depend on other metadata such as a different `assay` to disambiguate it from other datasets in the collection.

Curators MAY also annotate the following optional keys and values in uns. If the key is present, then its value MUST NOT be empty.

batch_condition

Key	batch_condition
Annotator	Curator MAY annotate.
Value	`list[str]`. `str` values MUST refer to cell metadata keys in `obs`. Together, these keys define the batches that a normalization or integration algorithm should be aware of. For example if `"patient"` and `"seqBatch"` are keys of vectors of cell metadata, either `["patient"]`, `["seqBatch"]`, or `["patient", "seqBatch"]` are valid values.

{column}_colors

Key

{column}_colors where {column} MUST be the name of a category data type column in obs that
is annotated by the data submitter or curator. The following columns that are annotated by CELLxGENE
Discover MUST NOT be specified as {column}:

assay
cell_type
development_stage
disease
organism
self_reported_ethnicity
sex
tissue

Instead annotate {column}_ontology_term_id_colors for these columns such as assay_ontology_term_id.

Annotator Curator MAY annotate.

Value

numpy.ndarray. This MUST be a 1-D array of shape (, c), where c is greater than or equal to the
number of categories in the {column} as calculated by:

anndata.obs.{column}.cat.categories.size

The color code at the Nth position in the ndarray corresponds to the Nth category of anndata.obs.{column}.cat.categories.

For example, if cell_type_ontology_term_id includes two categories:

anndata.obs.cell_type_ontology_term_id.cat.categories.values

array(['CL:0000057', 'CL:0000115'], dtype='object')

then cell-type_ontology_term_id_colors MUST contain two or more colors such as:

['aqua' 'blueviolet']

where 'aqua' is the color assigned to 'CL:0000057' and 'blueviolet' is the color assigned to
'CL:0000115'.

All elements in the ndarray MUST use the same color model, limited to:

Color Model	Element Format
Named Colors	`str`. MUST be a case-insensitive CSS4 color name with no spaces such as `"aliceblue"`
Hex Triplet	`str`. MUST start with `"#"` immediately followed by six case-insensitive hexadecimal characters as in `"#08c0ff"`

default_embedding

Key	default_embedding
Annotator	Curator MAY annotate.
Value	`str`. The value MUST match a key to an embedding in `obsm` for the embedding to display by default in CELLxGENE Explorer.

X_approximate_distribution

Key	X_approximate_distribution
Annotator	Curator MAY annotate.
Value	`str`. CELLxGENE Discover runs a heuristic to detect the approximate distribution of the data in X so that it can accurately calculate statistical properties of the data. This field enables the curator to override this heuristic and specify the data distribution explicitly. The value MUST be `"count"` (for data whose distributions are best approximated by counting distributions like Poisson, Binomial, or Negative Binomial) or `"normal"` (for data whose distributions are best approximated by the Gaussian distribution.)

Curators MUST NOT annotate the following keys and values in uns.

When a dataset is uploaded, CELLxGENE Discover MUST automatically add the citation key and set its value.

citation

Key citation

Annotator CELLxGENE Discover MUST annotate.

Value

str. Its format MUST use the following template:

Citation Element	Value
`"Publication: "`	Publication DOI url for the collection This element MUST only be present if a Publication DOI is defined for the collection; otherwise, it MUST NOT be present.
`"Dataset Version: "`	Permanent url to this version of the dataset
`" curated and distributed by CZ CELLxGENE Discover in Collection: "`	Permanent url to the collection

A citation for a H5AD dataset with a Publication DOI:

"

Publication: https://doi.org/10.1126/science.abl4896 Dataset Version: https://datasets.cellxgene.cziscience.com/dbd8b789-3efa-4a63-9243-90cff64f2045.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/e5f58829-1a66-40b5-a624-9046778e74f5"

A citation for a RDS dataset without a Publication DOI:

"Dataset Version: https://datasets.cellxgene.cziscience.com/08ea16dc-3f4e-4c84-8692-74d70be22d12.rds curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/10bf5c50-8d85-4c5f-94b4-22c1363d9f31"

When a dataset is uploaded, CELLxGENE Discover MUST automatically add the schema_reference key and set its value to the permanent URL of this document.

schema_reference

Key	schema_reference
Annotator	CELLxGENE Discover MUST annotate.
Value	This MUST be `"https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.2.0/schema.md"`.

When a dataset is uploaded, CELLxGENE Discover MUST automatically add the schema_version key and its value to uns.

schema_version

Key	schema_version
Annotator	CELLxGENE Discover MUST annotate.
Value	This MUST be `"5.2.0"`.

Appendix A. Changelog

schema v5.2.0

General Requirements
- Updated AnnData from version 0.8.0 to version 0.8.0 or greater
Required Ontologies
- Updated CL to the 2024-08-16 release
- Updated EFO to the 2024-08-15 EFO 3.69.0 release
- Updated HsapDv to the 2024-05-28 release
- Updated MONDO to the 2024-08-06 release
- Updated MmusDv to the 2024-05-28 release
- Updated UBERON to the 2024-08-07 release
obs (Cell metadata)
- Updated requirements for development_stage_ontology_term_id to require the most accurate descendant of life cycle.
  - If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, this MUST be the most accurate descendant of HsapDv:0000001 for life cycle
  - If organism_ontolology_term_id is "NCBITaxon:10090" for Mus musculus, this MUST be the most accurate descendant of MmusDv:0000001 for life cycle
- Updated requirements for suspension_type
  - Added mCT-seq
  - Added MERFISH
  - Added ScaleBio single cell RNA sequencing
  - Added sci-RNA-seq3
  - Removed CITE-seq and its descendants
  - Removed smFISH and its descendants
  - Removed snmC-seq
  - Removed spatial proteomics and its descendants
  - Replaced snmC-seq2 with methylation profiling by high throughput sequencing and its descendants
uns (Dataset metadata)
- Updated schema_reference to "https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.2.0/schema.md"
- Updated schema_version to "5.2.0"
var and raw.var (Gene metadata)
- Updated the requirements for feature_length. All feature_biotypes are now included. The calculation of the value changed from the merged length of isoforms to the median of the lengths of isoforms.
- Added feature_type

schema v5.1.0

All references to "child" and "children" have been changed to "descendant" and "descendants" for accuracy.
Required Ontologies
- Updated CL to the 2024-04-05 release
- Updated EFO to the 2024-04-15 EFO 3.65.0 release
- Updated MONDO to the 2024-05-08 release
- Updated UBERON to the 2024-03-22 release
X (Matrix Layers)
- Added Visium Spatial Gene Expression to the table of assays
obs (Cell metadata)
- Added array_col for Visium Spatial Gene Expression when uns['spatial']['is_single'] is True
- Added array_row for Visium Spatial Gene Expression when uns['spatial']['is_single'] is True
- Updated the requirements for assay_ontology_term_id for Visium Spatial Gene Expression and Slide-seqV2. All observations must contain the same value.
- Updated the requirements for cell_type_ontology_term_id for Visium Spatial Gene Expression when uns['spatial']['is_single'] is True. The value must be "unknown" if the corresponding value of in_tissue is 0.
- Added in_tissue for Visium Spatial Gene Expression when uns['spatial']['is_single'] is True
- Updated the requirements for is_primary_data for Visium Spatial Gene Expression. The value must be Falsewhen uns['spatial']['is_single'] is False.
- Updated the requirements for self_reported_ethnicity_ontology_term_id. There must be no duplication of terms.
obsm (Embeddings)
- Restored v3.1.0 requirement allowing only numpy.ndarray values with specific shapes due to Seurat conversion failures
- Added spatial for Visium Spatial Gene Expression and Slide-seqV2
- Updated requirements for X_{suffix}. {suffix} MUST NOT be "spatial".
uns (Dataset metadata)
- Updated {column}_colors instructions
- Updated schema_reference to "https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.1.0/schema.md"
- Updated schema_version to "5.1.0"
- Added spatial for Visium Spatial Gene Expression and Slide-seqV2, including scale factors and underlay images for Visium Spatial Gene Expression.

schema v5.0.0

General Requirements
- Updated requirements to prohibit duplicate data submitter metadata field names in obs and var
Required Ontologies
- Updated CL to the 2024-01-04 release
- Updated EFO to the 2024-01-15 EFO 3.62.0 release
- Updated MONDO to the 2024-01-03 release
- Updated UBERON to the 2024-01-18 release
Required Gene Annotations
- Updated GENCODE (Human) to Human Reference GRCh38.p14 (GENCODE v44/Ensembl 110)
- Updated GENCODE (Mouse) to Mouse reference GRCm39 (GENCODE vM33/Ensembl 110)
obs (Cell metadata)
- Updated the requirements for assay_ontology_term_id to not allow the parent terms EFO:0002772 for assay by molecule and EFO:0010183 for single cell library construction. Their most accurate children are still valid.
- Breaking change. Updated the requirements for cell_type to annotate "unknown" as the label when the cell_type_ontology_term_id value is "unknown".
- Breaking change. Updated the requirements for cell_type_ontology_term_id to replace "CL:0000003" for native cell with "unknown" to indicate that the cell type is unknown.
- Updated the requirements for disease_ontology_term_id to restrict MONDO terms to the most accurate child of "MONDO:0000001" for disease or "MONDO:0021178" for injury or preferably its most accurate child.
obsm (Embeddings)
- Updated requirements for X_{suffix} to change the regular expression pattern from "^[a-zA-Z][a-zA-Z0-9]*$" to "^[a-zA-Z][a-zA-Z0-9_.-]*$"
uns (Dataset metadata)
- Updated requirements. The data stored as a value for a key in uns MUST be True, False, None, or its size MUST NOT be zero.
- Updated schema_reference to "https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.0.0/schema.md"
- Updated schema_version to "5.0.0"

schema v4.0.0

Required Ontologies
- Updated CL to the 2023-08-24 release
- Updated EFO to the 2023-08-15 EFO 3.57.0 release
- Updated HANCESTRO to the 3.0 release
- Updated MONDO to the 2023-08-02 release
- Updated UBERON to the 2023-09-05 release
obs (Cell metadata)
- Updated the requirements for cell_type_ontology_term_id
- Added index
- Added observation_joinid
- Updated the requirements for self_reported_ethnicity
- Updated the requirements for self_reported_ethnicity_ontology_term_id
- Added tissue_type
- Updated the requirements for tissue
- Updated the requirements for tissue_ontology_term_id
obsm (Embeddings)
- Prohibited ndarrays with a size of zero
- Updated requirements for X_{suffix}
obsp
- Added section and prohibited ndarrays with a size of zero
uns (Dataset metadata)
- Added citation
- Added {column}_colors
- Added schema_reference
- Updated the requirements for schema_version to be annotated by CELLxGENE Discover and not the curator
var and raw.var (Gene metadata)
- Added feature_length
varm
- Added section and prohibited ndarrays with a size of zero
varp
- Added section and prohibited ndarrays with a size of zero
X (Matrix Layers)
- Updated requirements for raw matrices

schema v3.1.0

Added section for Schema versioning
Required Ontologies
- Updated CL to the 2023-07-20 release
- Updated EFO to the 2023-07-17 EFO 3.56.0 release
- Updated MONDO to the 2023-07-03 release
- Updated NCBITaxon to the 2023-06-20 release
- Updated PATO to the 2023-05-18 release
- Updated UBERON to the 2023-06-28 release
obs (Cell metadata)
- assay_ontology_term_id
  - Added Visium Spatial Gene Expression to recommended values
  - Removed Smart-seq from recommended values
- suspension_type
  - Added MARS-seq
  - Added BD Rhapsody Whole Transcriptome Analysis
  - Added BD Rhapsody Targeted mRNA
  - Added inDrop
  - Added STRT-seq
  - Added TruDrop
  - Added GEXSCOPE technology
  - Added SPLiT-seq
  - Changed spatial transcriptomics by high-throughput sequencing [EFO:0030005] and its children to spatial transcriptomics [EFO:0008994] and its children
  - Updated Seq-Well [EFO:0008919] to Seq-Well [EFO:0008919] and its children
uns (Dataset metadata)
- schema_version
  - Must be annotated by CELLxGENE Discover and not the Curator.

schema v3.0.0

Updated AnnData version 0.7 to version 0.8.0
All references to the "final" matrix has been replaced with "normalized" for clarity.
General Requirements
- Reserved Names from previous schema versions that have since been deprecated MUST NOT be present.
- Updated pinned ontologies to require the most recent version
obs (Cell metadata)
- Removed guidance in assay_ontology_term_id that allowed clarifying text enclosed in parentheses if there was not an exact match for an assay.
- Added donor_id
- Renamed ethnicity_ontology_term_id to self_reported_ethnicity_ontology_term_id. Added "multiethnic" value.
- Renamed ethnicity to self_reported_ethnicity. Added "multiethnic" value.
- Added suspension_type
var and raw.var (Gene metadata)
- feature_biotype must be annotated by CELLxGENE Discover and not the Curator.
uns (Dataset metadata)
- Updated schema_version
- Deprecated X_normalization

schema v2.0.0

schema v2.0.0 substantially remodeled schema v1.1.0:

"must", "should", and select other words have a defined, standard meaning.
Curators are responsible for annotating ontology and gene identifiers. CELLxGENE Discover adds the assigned human-readable names for all identifiers.
Documented and pinned the required versions of ontologies and gene annotations used in schema validation.
General Requirements
- AnnData is now the canonical data format. The schema outline and descriptions are AnnData-centric.
- Metazoan multi-organism data is accepted by CELLxGene Discover. For data that is neither Human, Mouse, nor SARS-COV-2, features MUST be translated into orthologous genes from the Human and Mouse gene annotations.
- Policies for reserved names and redundant metadata are documented.
- #45 Updated reference to new PII content
X (matrix layers)
- Added guidance for sparse matrices
- Clarified matrix requirements by assay
obs (cell metadata)
- Empty ontology fields are no longer permitted.
- Moved organism from uns to obs
- Clarified requirements and added detailed guidance for assays, tissue, and development stages
- Added ontology for mouse development stages
- Added ontology for sex
- Added is_primary_data
var
- Replaced HGNC gene symbols as var.index with ENSEMBL or ERCC spike-in identifiers
- Added feature_name, index, and feature_reference
- Added feature_is_filtered
- Added requirements for raw.var which must be identical to var
uns
- Added batch_condition
- Added X_approximate_distribution
- Replaced layer_descriptions with X_normalization
- Replaced version which included corpora_schema_version and corpora_encoding_version with schema_version
- Deprecated tags and default_field presentation metadata
- Removed obs_column_colors

Files

schema.md

Latest commit

History

schema.md

File metadata and controls

Schema

Schema versioning

Background

Overview

General Requirements

Note on types

X (Matrix Layers)

Integration Metadata

Required Ontologies

Required Gene Annotations

obs (Cell Metadata)

index of pandas.DataFrame

array_col

array_row

assay_ontology_term_id

cell_type_ontology_term_id

development_stage_ontology_term_id

disease_ontology_term_id

donor_id

in_tissue

is_primary_data

organism_ontology_term_id

self_reported_ethnicity_ontology_term_id

sex_ontology_term_id

suspension_type

tissue_type

tissue_ontology_term_id

assay

cell_type

development_stage

disease

observation_joinid

organism

self_reported_ethnicity

sex

tissue

obsm (Embeddings)

spatial

X_{suffix}

obsp

var and raw.var (Gene Metadata)

index of pandas.DataFrame

feature_is_filtered

feature_biotype

feature_length

feature_name

feature_reference

feature_type

varm

varp

uns (Dataset Metadata)

spatial

is_single

spatial[library_id]

spatial[library_id]['images']

spatial[library_id]['images']['fullres']

spatial[library_id]['images']['hires']

spatial[library_id]['scalefactors']

spatial[library_id]['scalefactors']['spot_diameter_fullres']

spatial[library_id]['scalefactors']['tissue_hires_scalef']

title

batch_condition

{column}_colors

default_embedding

X_approximate_distribution

citation

schema_reference

schema_version

Appendix A. Changelog

schema v5.2.0

schema v5.1.0

schema v5.0.0

schema v4.0.0

schema v3.1.0

`X` (Matrix Layers)

`obs` (Cell Metadata)

`obsm` (Embeddings)

`obsp`

`var` and `raw.var` (Gene Metadata)

`varm`

`varp`

`uns` (Dataset Metadata)