-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some wells in load_data are missing (but are present in wells.csv.gz
)
#61
Comments
cc @shntnu |
Hey, I was wondering if there's any update about it? Thanks. |
Hey Niranj, thank you for all the support. Much appreciated it if we could figure it out. Thanks in advance :) |
This comment was marked as off-topic.
This comment was marked as off-topic.
As noted in #61 (comment), if a well is empty, the images will exist but no profiles will be created, and thus there will be no entry in I'll add this issue to our FAQ #62 |
IIUC, the problem seems to be the other way around. For this plate, there are wells and in |
Sorry, I totally missed that.
aws s3 cp s3://cellpainting-gallery/cpg0016-jump/source_10/workspace/load_data_csv/2021_08_12_U2OS_48_hr_run15/Dest210803-160702/load_data.csv.gz -|gunzip - |wc -l
# 417
aws s3 cp s3://cellpainting-gallery/cpg0016-jump/source_10/workspace/load_data_csv/2021_08_12_U2OS_48_hr_run15/Dest210803-160702/load_data_with_illum.csv.gz -|gunzip - |wc -l
# 417 That said, I confirmed that the
|
wells.csv.gz
)
You are right. |
@NinoDui Thank you so much for flagging this! Can you help us report how prevalent this issue is? Here's how you'd do it import pandas as pd
source_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13]
df_all = pd.DataFrame()
for source_id in source_ids:
df = load(
dataset="cpg0016-jump",
source=f"source_{source_id}}",
component="load_data_csv",
batch="2021_06_14_Batch6",
plate="BR00121429",
columns=["Metadata_Source", "Metadata_Plate", "Metadata_Well"],
)
# keep only distinct rows
df = df.drop_duplicates()
df_all = df_all.append(df) Now report which wells are present in Warning: I haven't checked the code (that was autogenerated by Copilot :D) |
I've just checked with my own comparison script and found 125 mismatches (at plate level) between the number of wells available on a plate and those provided by corresponding ResultsThe result is attached.
ScriptHow the wells from parquet are counted: S3_LOADDATA_FORMATTER = (
"s3://cellpainting-gallery/cpg0016-jump/"
"{Metadata_Source}/workspace/load_data_csv/"
"{Metadata_Batch}/{Metadata_Plate}/load_data_with_illum.parquet"
)
def fetch_well_from_parquet(row: pd.Series):
meta_path = S3_LOADDATA_FORMATTER.format(**row)
meta = pd.read_parquet(meta_path, storage_options=REMOTE_STORAGE_OPTION)
wells_from_parquet = meta['Metadata_Well'].unique()
return wells_from_parquet.shape[0] How the number of wells from plate & well metas are calculated: wells_info = wells.merge(plates, on=['Metadata_Source', 'Metadata_Plate'])
wells_info = wells_info[["Metadata_Source", "Metadata_Batch", "Metadata_Plate", "Metadata_Well"]]
well_count = wells_info.groupby(['Metadata_Source', 'Metadata_Batch', 'Metadata_Plate'])\
.agg(n_well_on_plate=('Metadata_Well', pd.Series.count))
|
This is so helpful! Thank you very much. Looking at your CSV file, I am relived to note that the only issue across the entire dataset is with that one plate you reported originally. For all others, there are more wells with images than with profiles, and this can happen, as mentioned in one of my previous comments above. We will recreate the load data parquet for that plate. This might take a while until we get to it. Will that block you? Thanks again! April 2024 Update: I took the CSV in #61 (comment) and turned it into a table below for easy searching.
|
Cool! Glad to know the issue led to limited effect. Thanks for all the checking and effort behind. I am taking on DL Model experiments based on the images and I could exclude the issued one manually. That's not blocking for short. Still hope to hear the issue be settled and benefit a larger group of researchers. You're actually doing something not only meaningful but cool. Best regards,
|
Hi there,
I happened to find the metadata for wells of source_10, batch 2021_08_12_U2OS_48_hr_run15, and plate Dest210803-160702 may be missed. May I get your help in double-checking it? Or feel free to correct me if I am not taking it at the right pace.
A quick demo of how to revise it:
It seems like the well info provided by
wells.csv.gz
is larger in amount compared to those retrieved fromload_data_with_illum.parquet
. Is that a corner case that I missed? Or is that being uploaded in progress?Thanks for your time and effort.
Best wishes,
Nino
The text was updated successfully, but these errors were encountered: