Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing sub-obs (variable length per observation data) #609

Closed
keller-mark opened this issue Aug 23, 2021 · 8 comments
Closed

Storing sub-obs (variable length per observation data) #609

keller-mark opened this issue Aug 23, 2021 · 8 comments

Comments

@keller-mark
Copy link

keller-mark commented Aug 23, 2021

Hi,
We have a use case related to #237 but slightly different.

We would like to store a second obs array ("sub-obs", where an observation is an individual transcript in a MERFISH experiment), but related to the first obs array (where an observation is an individual cell).

Has your team thought about how to deal with this use case?

I am thinking something like this:

Untitled presentation

where the transcript ID and cell ID columns in the sub-obs can be like foreign keys into the main obs dataframe.

Is it possible to add a differently-shaped obs and obsm to the same AnnData store?

cc @ilan-gold

@ilan-gold
Copy link
Contributor

Yes @keller-mark! We really want a way to store this spot-level data in AnnData but are just not sure how to go about it. I think this proposal is really solid. We would definitely also be interested in crossover with squidpy if that would be of interest.

@ivirshup
Copy link
Member

Hey, I have been talking about this a bit with @hspitzer and @giovp. I was wondering if something like Awkward array would be useful here? E.g. a data structure for ragged arrays, allowing variable length arrays per observation? This should be easy-ish to support since they've already got tooling for serialization.

@ilan-gold
Copy link
Contributor

@ivirshup Do you know how this would work with zarr? Would each observation have its own zarr array (as opposed to one shared one)?

@ivirshup
Copy link
Member

ivirshup commented Oct 20, 2021

Zarr allows for ragged arrays which is how I would probably try to do it, if I were doing it from scratch.

There have been implementations relying on awkwards to_buffers and from_buffers, which I like because it looks very easy to implement. It's been a bit since I looked at how these worked, so I can't comment on exactly what they do.

@ilan-gold
Copy link
Contributor

Zarr allows for ragged arrays which is how I would probably try to do it, if I were doing it from scratch. I believe there have been implementations relying on awkwards to_buffers and from_buffers, which I like because it looks very easy to implement.

🤯 I had no idea!! This looks extremely cool!

@giovp
Copy link
Member

giovp commented Oct 20, 2021

@ilan-gold thanks a lot for pointing me to ome/ngff#64 (review) as well as the hackathon, looks really interesting work!

From what I gathered, the effort over ome/gff is to use anndata for tabular representations for FISH-based spatial data. In that case, observations are measured molecules instead of cell. Beside being an interesting approach, and definitely useful for storing annotations as coordinates of the decoded molecules (alongside the image in zarr), I think what I have in mind here is slightly different. I might be completely wrong so I'll explain:

What we are missing in the current anndata/squidpy analysis toolkit is a way to represent the original (processed) data from FISH-based assays in the cell-level anndata representation. The original data, after processing/decoding, is essentially in the form described before (and in the PR). However, there is no direct way to index/slice/subset the decoded molecules to the cell-level observation. From my understanding of the problem, the cell-level observation is the basic unit for downstream analysis, and the one most useful for analysts. However, it could be desirable for EDA to eventually go back to the original molecule-level representation (e.g. given a clustering results, visualize all molecules of gene X in cell type Y across the tissue).
-- small detour: actually how to go from molecule level to cell level representation seems to be very challenging and there are a variety of approach, very differetn between each other, still being proposed.

For this, we essentially need a lookup table between cells (obs) and molecules (sub-obs). We already have something like this working but it's really ugly (essentially we store the sub-obs info as a Pandas series of lists, see this section of the tangram tutorial https://squidpy.readthedocs.io/en/latest/external_tutorials/tutorial_tangram.html#Deconvolution-and-mapping ). This look up table can then be used to index/slice/subset the sub-obs annotation table (which we could store actually in adata.uns for now) and create the molecule-level anndata on the spot during plotting or any downstream analysis task. The akward array proposed by @ivirshup (or just sparse matrices) are an interesting way to build such lookup table. We could then save that lookup table in adata.obsm and access the sub-obs annotation in adata.uns on the go when in need).
Again, different use case, orthogonal to the OME/ngff effort, would love to hear your thoughts on this.

Ciao!

p.s. I'll reply at rest of email later on

@ilan-gold
Copy link
Contributor

@giovp You're totally right about this, I got a little carried away. And there's no rush on the email - I need to start coding and stop emailing you all so much 😄 Let me collect my thoughts on this and I'll post soon!

@ivirshup
Copy link
Member

ivirshup commented Feb 7, 2023

Closed by #647

@ivirshup ivirshup closed this as completed Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants