Some wells in load_data are missing (but are present in `wells.csv.gz`) #61

NinoDui · 2023-04-03T07:10:19Z

Hi there,

I happened to find the metadata for wells of source_10, batch 2021_08_12_U2OS_48_hr_run15, and plate Dest210803-160702 may be missed. May I get your help in double-checking it? Or feel free to correct me if I am not taking it at the right pace.

A quick demo of how to revise it:

S3_LOADDATA_FORMATTER = (
    "s3://cellpainting-gallery/cpg0016-jump/"
    "{Metadata_Source}/workspace/load_data_csv/"
    "{Metadata_Batch}/{Metadata_Plate}/load_data_with_illum.parquet"
)

nan_index = {
    'Metadata_Source': 'source_10',
    "Metadata_Batch": '2021_08_12_U2OS_48_hr_run15',
    "Metadata_Plate": 'Dest210803-160702'
}

s3_path = S3_LOADDATA_FORMATTER.format(**nan_index)
s3_nan_meta = pd.read_parquet(s3_path, storage_options={"anon": True})
wells_from_parquet = s3_nan_meta['Metadata_Well'].unique() # <----- Here wells are enumerated from A01 to C22

wells = pd.read_csv(os.path.join(DATA_ROOT, "metadata", "well.csv.gz"))
wells_plate_info = wells.loc[(wells['Metadata_Source'] == nan_index['Metadata_Source']) & (wells['Metadata_Plate'] == nan_index['Metadata_Plate']), :]
wells_from_plate = wells_plate_info['Metadata_Well'].unique() # <----- Here wells are enumerated from A01 to P24

It seems like the well info provided by wells.csv.gz is larger in amount compared to those retrieved from load_data_with_illum.parquet. Is that a corner case that I missed? Or is that being uploaded in progress?

Thanks for your time and effort.

Best wishes,
Nino

The text was updated successfully, but these errors were encountered:

niranjchandrasekaran · 2023-04-03T13:01:43Z

cc @shntnu

NinoDui · 2023-04-06T02:33:02Z

Hey, I was wondering if there's any update about it? Thanks.

niranjchandrasekaran · 2023-04-06T12:22:42Z

Hi @NinoDui, I checked the load_data_with_illum.parquet and well.csv.gz files and found the same missing wells as you did.

@shntnu it looks like load_data.csv.gz has all 384 wells, but both load_data_with_illum.csv.gz and load_data_with_illum.parquet have only 70 wells.

NinoDui · 2023-04-06T12:56:19Z

Hi @NinoDui, I checked the load_data_with_illum.parquet and well.csv.gz files and found the same missing wells as you did.

@shntnu it looks like load_data.csv.gz has all 384 wells, but both load_data_with_illum.csv.gz and load_data_with_illum.parquet have only 70 wells.

@niranjchandrasekaran

Hey Niranj, thank you for all the support. Much appreciated it if we could figure it out. Thanks in advance :)

shntnu · 2023-04-06T13:52:29Z

As noted in #61 (comment), if a well is empty, the images will exist but no profiles will be created, and thus there will be no entry in wells.csv.gz

I'll add this issue to our FAQ #62

niranjchandrasekaran · 2023-04-06T14:01:29Z

As noted in #61 (comment), if a well is empty, the images will exist but no profiles will be created, and thus there will be no entry in wells.csv.gz

IIUC, the problem seems to be the other way around. For this plate, there are wells and in well.csv.gz and the aggregated profile parquet file also has all 384 wells. But the load_data_with_illum files seem to be missing wells.

shntnu · 2023-04-06T17:21:19Z

Sorry, I totally missed that.

it looks like load_data.csv.gz has all 384 wells, but both load_data_with_illum.csv.gz and load_data_with_illum.parquet have only 70 wells.

load_data.csv.gz seems to have the same number of rows as load_data_with_illum.csv.gz. can you verify @niranjchandrasekaran ?

aws s3 cp s3://cellpainting-gallery/cpg0016-jump/source_10/workspace/load_data_csv/2021_08_12_U2OS_48_hr_run15/Dest210803-160702/load_data.csv.gz -|gunzip - |wc -l
#     417

aws s3 cp s3://cellpainting-gallery/cpg0016-jump/source_10/workspace/load_data_csv/2021_08_12_U2OS_48_hr_run15/Dest210803-160702/load_data_with_illum.csv.gz -|gunzip - |wc -l
#     417

That said, I confirmed that the images folder does have all 384 wells

aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_10/images/2021_08_12_U2OS_48_hr_run15/images/Dest210803-160702/ |gzip > ~/Desktop/source_10_2021_08_12_U2OS_48_hr_run15_Dest210803-160702.txt.gz

gzcat ~/Desktop/source_10_2021_08_12_U2OS_48_hr_run15_Dest210803-160702.txt.gz|grep tif$|grep Dest21|tr -s " "|cut -d" " -f4|cut -d"_" -f2|sort|uniq -c|wc -l
#      384

gzcat ~/Desktop/source_10_2021_08_12_U2OS_48_hr_run15_Dest210803-160702.txt.gz|grep tif$|grep Dest21|tr -s " "|cut -d" " -f4|cut -d"_" -f2|sort|uniq -c|cut -d" " -f3|sort |uniq -c
# 384 48

niranjchandrasekaran · 2023-04-06T18:05:36Z

load_data.csv.gz seems to have the same number of rows as load_data_with_illum.csv.gz. can you verify @niranjchandrasekaran ?

You are right. load_data.csv.gz also has the same number of wells as the two illum files.

shntnu · 2023-04-06T23:34:18Z

@NinoDui Thank you so much for flagging this! Can you help us report how prevalent this issue is?

Here's how you'd do it

use https://github.com/broadinstitute/position-effect-correction/blob/aabfac2de536447ad095489be6eca73f3a5ae026/1.load/load.py

import pandas as pd

source_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13]

df_all = pd.DataFrame()

for source_id in source_ids:

    df = load(
        dataset="cpg0016-jump",
        source=f"source_{source_id}}",
        component="load_data_csv",
        batch="2021_06_14_Batch6",
        plate="BR00121429",
        columns=["Metadata_Source", "Metadata_Plate", "Metadata_Well"],
    )

    # keep only distinct rows
    df = df.drop_duplicates()

    df_all = df_all.append(df)

Now report which wells are present in well.csv.gz but absent in df_all

Warning: I haven't checked the code (that was autogenerated by Copilot :D)

NinoDui · 2023-04-14T07:49:42Z

I've just checked with my own comparison script and found 125 mismatches (at plate level) between the number of wells available on a plate and those provided by corresponding load_data_with_illum.parquet.

Results

The result is attached.

well_diff.csv

n_well_on_plate: number of wells calculated by (meta) plate.csv.gz and wells.csv.gz
n_well_from_parquet: number of wells retrieved from load_data_with_illum.parquet

Script

How the wells from parquet are counted:

S3_LOADDATA_FORMATTER = (
    "s3://cellpainting-gallery/cpg0016-jump/"
    "{Metadata_Source}/workspace/load_data_csv/"
    "{Metadata_Batch}/{Metadata_Plate}/load_data_with_illum.parquet"
)

def fetch_well_from_parquet(row: pd.Series):
    meta_path = S3_LOADDATA_FORMATTER.format(**row)
    meta = pd.read_parquet(meta_path, storage_options=REMOTE_STORAGE_OPTION)
    wells_from_parquet = meta['Metadata_Well'].unique()
    return wells_from_parquet.shape[0]

How the number of wells from plate & well metas are calculated:

wells_info = wells.merge(plates, on=['Metadata_Source', 'Metadata_Plate'])
wells_info = wells_info[["Metadata_Source", "Metadata_Batch", "Metadata_Plate", "Metadata_Well"]]
well_count = wells_info.groupby(['Metadata_Source', 'Metadata_Batch', 'Metadata_Plate'])\
        .agg(n_well_on_plate=('Metadata_Well', pd.Series.count))

@NinoDui Thank you so much for flagging this! Can you help us report how prevalent this issue is?

Here's how you'd do it

use https://github.com/broadinstitute/position-effect-correction/blob/aabfac2de536447ad095489be6eca73f3a5ae026/1.load/load.py
import pandas as pd

source_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13]

df_all = pd.DataFrame()

for source_id in source_ids:

    df = load(
        dataset="cpg0016-jump",
        source=f"source_{source_id}}",
        component="load_data_csv",
        batch="2021_06_14_Batch6",
        plate="BR00121429",
        columns=["Metadata_Source", "Metadata_Plate", "Metadata_Well"],
    )

    # keep only distinct rows
    df = df.drop_duplicates()

    df_all = df_all.append(df)
Now report which wells are present in well.csv.gz but absent in df_all

Warning: I haven't checked the code (that was autogenerated by Copilot :D)

shntnu · 2023-04-14T11:12:28Z

This is so helpful! Thank you very much.

Looking at your CSV file, I am relived to note that the only issue across the entire dataset is with that one plate you reported originally. For all others, there are more wells with images than with profiles, and this can happen, as mentioned in one of my previous comments above.

We will recreate the load data parquet for that plate.

This might take a while until we get to it. Will that block you?

Thanks again!

April 2024 Update: I took the CSV in #61 (comment) and turned it into a table below for easy searching.

weird means more_wells_with_profiles_than_with_images

Metadata_Source	Metadata_Plate	n_well_on_plate	n_well_from_parquet	weird
source_1	UL001659	1471	1472	FALSE
source_1	UL001661	1355	1472	FALSE
source_1	UL001649	1471	1472	FALSE
source_1	UL000083	1419	1472	FALSE
source_1	UL000093	1471	1472	FALSE
source_1	UL001769	1471	1472	FALSE
source_1	UL001771	1471	1472	FALSE
source_1	UL000597	1309	1472	FALSE
source_1	UL001717	1471	1472	FALSE
source_10	Dest210608-152610	383	384	FALSE
source_10	Dest210608-153057	383	384	FALSE
source_10	Dest210614-164418	382	384	FALSE
source_10	Dest210614-164906	383	384	FALSE
source_10	Dest210615-151336	383	384	FALSE
source_10	Dest210622-144809	383	384	FALSE
source_10	Dest210726-155958	383	384	FALSE
source_10	Dest210726-160957	383	384	FALSE
source_10	Dest210726-162012	383	384	FALSE
source_10	Dest210803-160702	384	70	TRUE
source_10	Dest210809-134534	383	384	FALSE
source_10	Dest210810-173723	383	384	FALSE
source_10	Dest210823-173617	383	384	FALSE
source_11	EC000004	383	384	FALSE
source_11	EC000038	250	384	FALSE
source_11	EC000045	189	384	FALSE
source_11	EC000048	126	384	FALSE
source_11	EC000068	382	383	FALSE
source_11	LM37-70_1	383	384	FALSE
source_11	EC000137	250	384	FALSE
source_11	EC000115	378	381	FALSE
source_11	EC000121	381	383	FALSE
source_11	EC000094	360	384	FALSE
source_11	LM71-102_1	383	384	FALSE
source_11	LM71-102_2	382	384	FALSE
source_2	1053601763	383	384	FALSE
source_2	1053601770	383	384	FALSE
source_2	1053601831	383	384	FALSE
source_2	1053601909	383	384	FALSE
source_2	1053600674	383	384	FALSE
source_2	1053600728	383	384	FALSE
source_2	1053600803	383	384	FALSE
source_2	1053600810	383	384	FALSE
source_2	1053600834	383	384	FALSE
source_2	1053600858	383	384	FALSE
source_2	1053600872	383	384	FALSE
source_2	1053599503	378	384	FALSE
source_2	1053599602	383	384	FALSE
source_2	1053599657	383	384	FALSE
source_2	1086293911	383	384	FALSE
source_2	1086293997	383	384	FALSE
source_2	1086293423	383	384	FALSE
source_2	1086293447	383	384	FALSE
source_2	1086293485	382	384	FALSE
source_2	1086293027	383	384	FALSE
source_2	1086293034	383	384	FALSE
source_2	1086293133	383	384	FALSE
source_2	1086293164	383	384	FALSE
source_2	1086293232	383	384	FALSE
source_2	1086292723	381	384	FALSE
source_2	1086292761	383	384	FALSE
source_2	1086292884	383	384	FALSE
source_2	1086292976	383	384	FALSE
source_2	1086292389	383	384	FALSE
source_2	1086292396	383	384	FALSE
source_2	1086292440	382	384	FALSE
source_2	1086292495	382	384	FALSE
source_2	1086291931	383	384	FALSE
source_2	1086291948	382	384	FALSE
source_2	1086291962	382	384	FALSE
source_2	1086291979	382	384	FALSE
source_2	1086291986	382	384	FALSE
source_2	1086292013	381	384	FALSE
source_2	1086292037	383	384	FALSE
source_2	1086292044	382	384	FALSE
source_2	1086292051	382	384	FALSE
source_2	1086292075	380	384	FALSE
source_2	1086292082	383	384	FALSE
source_2	1086292099	380	384	FALSE
source_2	1086292105	382	384	FALSE
source_2	1086292136	382	384	FALSE
source_2	1086292143	382	384	FALSE
source_2	1086292150	383	384	FALSE
source_2	1086289792	383	384	FALSE
source_3	BR5874c3	383	384	FALSE
source_3	SP01P04b	381	384	FALSE
source_3	B040203d	382	384	FALSE
source_3	A13407bW	383	384	FALSE
source_3	P24P27dW	319	384	FALSE
source_3	DMSOC45	383	384	FALSE
source_3	JCPQC036	383	384	FALSE
source_3	BAY5872a	383	384	FALSE
source_3	BAY5874b	382	384	FALSE
source_3	BAY5874c	383	384	FALSE
source_3	BAY5875a	383	384	FALSE
source_3	BAY5875c	383	384	FALSE
source_3	BAY5876d	383	384	FALSE
source_4	BR00121436	383	384	FALSE
source_4	BR00121428	383	384	FALSE
source_4	BR00125181	383	384	FALSE
source_4	BR00125638	382	384	FALSE
source_4	BR00121427	383	384	FALSE
source_4	BR00121426	383	384	FALSE
source_4	BR00127147	382	384	FALSE
source_5	ACPJUM032	382	384	FALSE
source_5	APTJUM230	383	384	FALSE
source_5	ATSJUM107	383	384	FALSE
source_6	1.1E+11	383	384	FALSE
source_6	1.1E+11	383	384	FALSE
source_6	1.1E+11	383	384	FALSE
source_6	1.1E+11	383	384	FALSE
source_6	1.1E+11	383	384	FALSE
source_7	CP1-SC1-08	381	384	FALSE
source_7	CP1-SC1-09	383	384	FALSE
source_8	A1170444	383	384	FALSE
source_8	A1170509	383	384	FALSE
source_8	A1170531	383	384	FALSE
source_8	A1170533	382	384	FALSE
source_8	A1170538	383	384	FALSE
source_8	A1170542	382	384	FALSE
source_8	A1170543	383	384	FALSE
source_8	A1170544	383	384	FALSE
source_8	A1166130	383	384	FALSE
source_8	A1166132	382	384	FALSE
source_8	A1166134	383	384	FALSE
source_8	A1166137	383	384	FALSE

NinoDui · 2023-04-18T01:34:42Z

Cool! Glad to know the issue led to limited effect. Thanks for all the checking and effort behind.

I am taking on DL Model experiments based on the images and I could exclude the issued one manually. That's not blocking for short.

Still hope to hear the issue be settled and benefit a larger group of researchers. You're actually doing something not only meaningful but cool.

Best regards,
NinoDui

This is so helpful! Thank you very much.

Looking at your CSV file, I am relived to note that the only issue across the entire dataset is with that one plate you reported originally. For all others, there are more wells with images than with profiles, and this can happen, as mentioned in one of my previous comments above.

We will recreate the load data parquet for that plate.

This might take a while until we get to it. Will that block you?

Thanks again!

niranjchandrasekaran added bug Something isn't working cpg0016 labels Apr 6, 2023

This comment was marked as off-topic.

Sign in to view

shntnu changed the title ~~Metadata for a well is missed. Thanks~~ Some wells in load_data are missing (but are present in wells.csv.gz) Apr 6, 2023

hanslovsky mentioned this issue Aug 8, 2023

Missing labels for some wells in COMPOUND_EMPTY plates in source_1 #75

Open

Arkkienkeli mentioned this issue Feb 14, 2024

Report corrupt TIFF files, filter load_data where images are actually missing #76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some wells in load_data are missing (but are present in `wells.csv.gz`) #61

Some wells in load_data are missing (but are present in `wells.csv.gz`) #61

NinoDui commented Apr 3, 2023 •

edited

Loading

niranjchandrasekaran commented Apr 3, 2023

NinoDui commented Apr 6, 2023

niranjchandrasekaran commented Apr 6, 2023

NinoDui commented Apr 6, 2023

This comment was marked as off-topic.

shntnu commented Apr 6, 2023

niranjchandrasekaran commented Apr 6, 2023

shntnu commented Apr 6, 2023 •

edited

Loading

niranjchandrasekaran commented Apr 6, 2023 •

edited

Loading

shntnu commented Apr 6, 2023

NinoDui commented Apr 14, 2023 •

edited

Loading

shntnu commented Apr 14, 2023 •

edited

Loading

NinoDui commented Apr 18, 2023

Some wells in load_data are missing (but are present in wells.csv.gz) #61

Some wells in load_data are missing (but are present in wells.csv.gz) #61

Comments

NinoDui commented Apr 3, 2023 • edited Loading

niranjchandrasekaran commented Apr 3, 2023

NinoDui commented Apr 6, 2023

niranjchandrasekaran commented Apr 6, 2023

NinoDui commented Apr 6, 2023

This comment was marked as off-topic.

shntnu commented Apr 6, 2023

niranjchandrasekaran commented Apr 6, 2023

shntnu commented Apr 6, 2023 • edited Loading

niranjchandrasekaran commented Apr 6, 2023 • edited Loading

shntnu commented Apr 6, 2023

NinoDui commented Apr 14, 2023 • edited Loading

Results

Script

shntnu commented Apr 14, 2023 • edited Loading

NinoDui commented Apr 18, 2023

Some wells in load_data are missing (but are present in `wells.csv.gz`) #61

Some wells in load_data are missing (but are present in `wells.csv.gz`) #61

NinoDui commented Apr 3, 2023 •

edited

Loading

shntnu commented Apr 6, 2023 •

edited

Loading

niranjchandrasekaran commented Apr 6, 2023 •

edited

Loading

NinoDui commented Apr 14, 2023 •

edited

Loading

shntnu commented Apr 14, 2023 •

edited

Loading