-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uploading CellPose-segmented outlines for cpg0016 #73
Comments
@ErinWeisbart @timtreis – please feel free to weigh in on the folder structure below Location of
Structure of
And regarding this:
We can decide on this closer to when you are ready to go but for now, I'll hand this over to @leoank to ponder |
I agree that I don't think we need to force the segmentations to comply with the exact same structure of the CellProfiler What do you think about this @shntnu?
|
I love it! |
I don't think we need to include the CellPose model or its training, we basically use it "off-the-shelf" since it scales the cells internally to the avg cell diameter it was trained on and then goes big again. So that'd be probably a waste of space. We'll put the (snakemake) pipeline we use and maybe some processing public of course, but it mostly just downloads whatever it needs 👍 |
Thanks @timtreis. No requirement to include anything in |
@timtreis We are all set with the folder structure. Let me know when you are ready to do a test run. Thanks a lot @ErinWeisbart ! |
Hey @ErinWeisbart and @shntnu, many thanks for already preparing everything! Our trial on optimized CellPose parameters got slightly delayed because we had to modify the pipeline (turns out building a DAG in snakemake with several million files is slightly suboptimal 🥸). I hope to have the results by early next week (will post here) and will then start with one source so that we can test the transfer workflow? Does that make sense? |
@timtreis reported: For the pilot, we’re now performing the segmentation with different parameters for the nucleus and cytosol parameter in CellPose on a stratified sample of the wells (excluding 9 because the data quality always stood out as poor): The first run with the new setup is currently running. Once that’s done we’ll see if we perform a 3x3 grid or 5x5 grid for defining the final parameters. This here was the initial analysis based on what we had already downloaded for our hackathon: The current idea would be to do mean +/- sd (3x3) or mean +/- (1x/0.5x) sd (5x5). |
Can you remind me what the issue was here?
I am unclear about the terminology – what is this grid you are referring to? Is it in pixel space?
Can you clarify? I didn't quite get it :D |
During basically all the work on integrating JUMP I've done, source 9 always behaved fundamentally different from the other sources, so I started to exclude it as to not make method dev harder than it is :D During the later parts of the project(s) I'd of course try to include it, but for this benchmark I was afraid of it skewing the distribution in a weird way.
CellPose accepts the roughly expected diameter of the nucleus and cytosol as parameters. So we'll run it on the stratified sample with different permutations of this and compare the number of segmented cells passing a certain (yet to be determined) quality threshold. And only then we do the full data. Ideally, we'd want to not just segment JUMP but segment it well :D Afterthought: If we see that there are big discrepancies, we could also run a binary search on them. Also tagging @npeschke who's the fantastic student working with me who's written most of the pipeline code |
Thank you! Happy to discuss details |
Hey @shntnu @ErinWeisbart, had a meeting with @npeschke yesterday, and during our discussion, a few questions came up. Just throwing them here for discussion.
I guess the least-storage-intense way would be to provide only the nucleus/cytosol mask files and code to chop that into individual tiles. But that'd require every user to perform this step again and again. Cheers, |
My thoughts, @shntnu as always feel free to disagree: \1. You have the best intuition of use case and I support making everything as user-friendly as possible! While we don't want to increase data storage thoughtlessly, I think it is quite reasonable to include these crops and we have precedent of storing other image "intermediates" with other datasets. 2-3. I'm attached to folder nesting but not file structure within the folder nesting so chopping is fine. I do think retaining location information is important for preserving the ability to easily map to other profiling data. Again, reasonable expansion of storage is fine. My initial thought is that you could/should save out the per-site segmentation as well and use those integers in the cropped file name. (I suppose an alternative would be providing the x,y coordinate of the cell center in the file name, but that alone wouldn't be my preference). My preference would be that original channel naming is preserved with the masks at the end, again for easier mapping between data. So the image example you've given above would be Channel 1-5 (as they are in the raw images) and Channel 6-7 would be masks. And finally, since I love being pedantic, I think "outlines" isn't really sufficient to capture the breadth of data you're providing, so I suggest changing that to "masked_objects". So if I'm piecing everything together correctly, it would look something like this:
\4. The simplest file transfer approach is if you have your files on an S3 bucket, you can make them public and then I can copy the files to the cpg and then we don't have to do any fancy credential handling. This is by far the easiest and therefore our preferred approach for getting data into cpg. |
Everything @ErinWeisbart said sounds reasonable to me. Some comments
|
@shntnu Yes, I meant for |
Thanks for clarifying @ErinWeisbart |
@shntnu Nice to meet you too! Just to get everyone on the same page regarding the output of our pipeline: The hdf5 file itself has the following group & dataset structure:
On the image id level all related data (InChI, Source, Plate, Well, etc.) from the parquet files are also saved as attributes in the hdf5 file. At the moment, we discard the full frame segmentation during our pipeline but that can be changed easily if you want to have the full masks as well. |
Agree with this.
I think Nic and I are ambiguous to this, just sth we'll need to define and then we'll do it this way.
Fine for us 👍
I'm not sure whether we have a S3 bucket where we can temporarily store that data, but I'll ask around.
Yeah, that's the idea 👌
I think this is something I mentioned in passing. We could theoretically run a pilot in which we store all this data in a (self-ad) https://spatialdata.scverse.org/en/latest/ object, which would f.e. allow us to map an arbitrary number of cells as well as the original image onto a shared coordinate system. But as Nic said, currently, we're writing to hdf5 but are, of course, flexible for storage.
I'm not necessarily against this, and the files would be tiny, but this information would be redundant when the filenames indicate the integer from the full label image, right? |
One aside that is not terribly consequential for this data set, but might be for future ones - because Cellpose does not create overlapping objects, it's possible to create a label image where |
@timtreis @npeschke @ErinWeisbart – I've summarized our discussion and decisions so far, below. Shall we zoom in the new year to finalize? I've sent us all an invite (added Beth as optional)
Location of
Structure of
Proposed changes to Q: what is the
Structure of
Structure of
Other notes:
Include a
|
We have now started a Slack channel to discuss this project https://join.slack.com/share/enQtNjM4MTk5NTc5MzYxNy0yNDYxYTM4MThiMThiYTMzMDY0ZDI0NzkyYzJhZWVjNGI0OGU4ODA4MjZjMTVhZWJmNjRhMzgyYjI0YjQzZWQ0 |
In case the full segmentation masks of all cells in an image is kept and saved, the |
Notes from meeting, 2024-01-03:
|
Additionally storing the locations would require changes to SPARCSpy (the framework we use to generate the segmentations and extract single cells) itself. Therefore, I would suggest to save the label_image together with the single_cell_data + _index in the zarr file. This would still leave the possibility to trace back single cells to the original image and with additional effort to calculate the centroid again. |
I am checking against the current snapshot of https://cellpainting-gallery.s3.amazonaws.com/index.html#cpg0016-jump/source_2/workspace/segmentation/cellpose/ Not done:
Done:
Unsure if done or not:
|
@ErinWeisbart I suppose we should update https://github.com/broadinstitute/cellpainting-gallery/blob/789a5a65d8f8bf653b995b1d176e4adb90885af2/folder_structure.md#segmentation-folder-structure? If so, I'd request Tim or Nic to draft it if they are able, and we can review |
@timtreis @ErinWeisbart – looping back on this It sounds like Tim is ready to upload more data to the gallery staging bucket. Before he does
|
I updated when I converted our docs to a Jupyterbook https://broadinstitute.github.io/cellpainting-gallery/data_structure.html#segmentation-folder-structure but it's worth having @timtreis take a look and see 1) if I've introduced any mistakes 2) if there is anything other knowledge he can transfer there that would be helpful for others in the future |
Happy to add, what do we want to include? I assume the version of the CellPose model the pipeline delegates to. I could also include a release version of the pipeline repo I made with a link. Anything else?
I forgot what we wanted to use for the unique string, but I assume it's arbitrary anyway, right?
We haven't followed up on this much yet since we weren't sure about the usefulness of this data - wdyt?
Don't fully understand this, can you elaborate @shntnu ?
the pipeline only creates blacked out tiles
This information is contained in a {"0": "NucleusMask",
"1": "CellMask",
"2": "DNA",
"3": "AGP",
"4": "ER",
"5": "Mito",
"6": "RNA"} So we no longer have this filename information. Am I addressing your point? 🤔 |
re: |
Note that the checklist I created was by looking at your notes in #73 (comment), necessarily what I had in mind :D
That seems good enough
Does this mean that the crops cannot be currently used to mask the original images? If so, we should just state that in your readme https://github.com/theislab/jump-cpg0016-segmentation/blob/main/README.md and then leave the data as is (no need to figure it out right now)
No clue :D It was from your notes
Ok by blacked out tiles you are referring to the first image in this comment
Noted; as long as @ErinWeisbart is good with this, I am good with it
Yes |
@timtreis I'd recommend transferring just one source,
In fact, even one plate would be good enough for now. |
@timtreis please have a look when you get the chance |
Currently finishing this up :) The output that the pipeline now generates is as follows (simplified for a single batch):
In the README.md, I have now added the following text:
wdyt @shntnu @ErinWeisbart ? |
I have now included the release-tagged version of the pipeline, the version of SPARCSpy that we delegate to, and the cellpose build version (see comment above)
Yes, we currently cannot do this. I thought you mentioned you already had centroid coordinates from CellProfiler, so extracting a bounding box around those would be fairly comparable (although it'd be a PITA to trace which blacked-out cell belongs to which cropped cell).
Ah yes, I remember 😅 That was the question whether we scale the tiles in any way to a desired target resolution, but we chose not to.
Yes, except of course as individual images and not a 7x1 strip :)
Asked her, she's good with it 👍
Cool! |
@timtreis thank you so much for your diligence! Everything looks good to me. |
Just a reminder that a small set would be good to start with. |
Yes, going to cook dinner and then try that 👌🏻 Ankur provided me with a tutorial |
From @ErinWeisbart
import boto3
session = boto3.Session(profile_name='CPGnew') # in ~/.aws/config, section named [profile PROFILENAME], must have key, secret key, region, output
s3 = session.client('s3')
batches = s3.list_objects_v2(Bucket='cellpainting-gallery',Prefix='cpg0016-jump/source_8/images/',Delimiter='/')
batches =[x['Prefix'] for x in batches['CommonPrefixes']]
batchdict = {}
for batch in batches:
plates = s3.list_objects_v2(Bucket='cellpainting-gallery',Prefix=f"{batch}images/",Delimiter='/')
plates =[x['Prefix'].rsplit('/',2)[1] for x in plates['CommonPrefixes']]
batchdict[batch] = plates
for batch in batches:
b = batch.rsplit('/',2)[1]
for plate in batchdict[batch]:
print(f'aws s3 sync s3://staging-cellpainting-gallery/tim_test/{plate}.zarr/ s3://cellpainting-gallery/cpg0016-jump/source_8/workspace/segmentation/cellpose_202404/objects/{b}/{plate}/{plate}.zarr] --profile CPGnew') |
@timtreis asked:
The text was updated successfully, but these errors were encountered: