Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require CSR encoding in X (Matrix Layers) #1022

Open
brianraymor opened this issue Sep 25, 2024 · 4 comments
Open

Require CSR encoding in X (Matrix Layers) #1022

brianraymor opened this issue Sep 25, 2024 · 4 comments
Assignees
Labels
5.3 Next minor CELLxGENE schema version after 5.2 schema CELLxGENE Discover dataset schema

Comments

@brianraymor
Copy link
Contributor

brianraymor commented Sep 25, 2024

Design

X (Matrix Layers)

The data stored in the X data matrix is the data that is viewable in CELLxGENE Explorer. It MUST be encoded as a scipy.sparse.csr_matrix with zero values encoded as implicit zeros.

CELLxGENE's matrix layer requirements are tailored to optimize data reuse. Because each assay has different characteristics, the requirements differ by assay type. In general, CELLxGENE requires submission of "raw" data suitable for computational reuse when a standard raw matrix format exists for an assay. It is STRONGLY RECOMMENDED to also include a "normalized" matrix with processed values ready for data analysis and suitable for visualization in CELLxGENE Explorer. The schema imposes the following requirements:

  • All matrix layers MUST have the same shape, and have the same cell labels and gene labels.
  • ...

Context

Per conversation on the September 25 DP Call, @nayib-jose-gloria proposed:

CSC matrices–only found 2 in the corpus, and they are memory inefficient compared to CSR/COO. How do we feel about converting existing cases and codifying in the schema that we won’t take CSC? Or putting a size limit on CSC submissions?
“Backed” mode is prohibitively slow for validating CSC, requiring us to read into memory; we allocate processing resources relative to the worst case scenario, and we are quite close to making backed reads/writes happen in constant memory for CSR/COO matrices.

Also see continuing conversation in cell-sci-platform.

@brianraymor brianraymor added schema CELLxGENE Discover dataset schema 5.3 Next minor CELLxGENE schema version after 5.2 labels Sep 25, 2024
@brianraymor brianraymor self-assigned this Sep 30, 2024
@jahilton
Copy link
Collaborator

I would think to also extend this to .raw.X, when present.

@joyceyan
Copy link
Contributor

@jahilton @brianraymor My understanding is that for existing datasets with CSC, they should be converted (or in the process of converting) to CSR. For any new datasets the validator sees with CSC, the expectation is that the validator should raise a failure message rather than converting to CSR.

It also seems reasonable to me that this requirement should extend to .raw.X

@brianraymor
Copy link
Contributor Author

Slight refinements but otherwise, agree:

My understanding is that for existing datasets with CSC, they should must be converted by curators ... And @jahilton is addressing that per earlier comments.

the expectation is that the validator should must raise a failure message rather than converting to CSR.

I will be updating the 5.3 draft soonish.

@nayib-jose-gloria
Copy link
Contributor

@brianraymor I think for layers with low sparsity we'll still want to support dense matrices as well; we just want to limit the options for sparse matrices to encode. That said, I can follow-up and see if our initial review of the corpus covered how many dense X/raw.X matrices actually exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.3 Next minor CELLxGENE schema version after 5.2 schema CELLxGENE Discover dataset schema
Projects
None yet
Development

No branches or pull requests

4 participants