Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some wells in load_data are missing (but are present in wells.csv.gz) #61

Open
NinoDui opened this issue Apr 3, 2023 · 13 comments
Open
Labels
bug Something isn't working cpg0016

Comments

@NinoDui
Copy link

NinoDui commented Apr 3, 2023

Hi there,

I happened to find the metadata for wells of source_10, batch 2021_08_12_U2OS_48_hr_run15, and plate Dest210803-160702 may be missed. May I get your help in double-checking it? Or feel free to correct me if I am not taking it at the right pace.

A quick demo of how to revise it:

S3_LOADDATA_FORMATTER = (
    "s3://cellpainting-gallery/cpg0016-jump/"
    "{Metadata_Source}/workspace/load_data_csv/"
    "{Metadata_Batch}/{Metadata_Plate}/load_data_with_illum.parquet"
)

nan_index = {
    'Metadata_Source': 'source_10',
    "Metadata_Batch": '2021_08_12_U2OS_48_hr_run15',
    "Metadata_Plate": 'Dest210803-160702'
}

s3_path = S3_LOADDATA_FORMATTER.format(**nan_index)
s3_nan_meta = pd.read_parquet(s3_path, storage_options={"anon": True})
wells_from_parquet = s3_nan_meta['Metadata_Well'].unique() # <----- Here wells are enumerated from A01 to C22

wells = pd.read_csv(os.path.join(DATA_ROOT, "metadata", "well.csv.gz"))
wells_plate_info = wells.loc[(wells['Metadata_Source'] == nan_index['Metadata_Source']) & (wells['Metadata_Plate'] == nan_index['Metadata_Plate']), :]
wells_from_plate = wells_plate_info['Metadata_Well'].unique() # <----- Here wells are enumerated from A01 to P24

It seems like the well info provided by wells.csv.gz is larger in amount compared to those retrieved from load_data_with_illum.parquet. Is that a corner case that I missed? Or is that being uploaded in progress?

Thanks for your time and effort.

Best wishes,
Nino

@niranjchandrasekaran
Copy link
Contributor

cc @shntnu

@NinoDui
Copy link
Author

NinoDui commented Apr 6, 2023

Hey, I was wondering if there's any update about it? Thanks.

@niranjchandrasekaran
Copy link
Contributor

Hi @NinoDui, I checked the load_data_with_illum.parquet and well.csv.gz files and found the same missing wells as you did.

@shntnu it looks like load_data.csv.gz has all 384 wells, but both load_data_with_illum.csv.gz and load_data_with_illum.parquet have only 70 wells.

@niranjchandrasekaran niranjchandrasekaran added bug Something isn't working cpg0016 labels Apr 6, 2023
@NinoDui
Copy link
Author

NinoDui commented Apr 6, 2023

Hi @NinoDui, I checked the load_data_with_illum.parquet and well.csv.gz files and found the same missing wells as you did.

@shntnu it looks like load_data.csv.gz has all 384 wells, but both load_data_with_illum.csv.gz and load_data_with_illum.parquet have only 70 wells.

@niranjchandrasekaran

Hey Niranj, thank you for all the support. Much appreciated it if we could figure it out. Thanks in advance :)

@shntnu

This comment was marked as off-topic.

@shntnu
Copy link
Contributor

shntnu commented Apr 6, 2023

As noted in #61 (comment), if a well is empty, the images will exist but no profiles will be created, and thus there will be no entry in wells.csv.gz

I'll add this issue to our FAQ #62

@niranjchandrasekaran
Copy link
Contributor

As noted in #61 (comment), if a well is empty, the images will exist but no profiles will be created, and thus there will be no entry in wells.csv.gz

IIUC, the problem seems to be the other way around. For this plate, there are wells and in well.csv.gz and the aggregated profile parquet file also has all 384 wells. But the load_data_with_illum files seem to be missing wells.

@shntnu
Copy link
Contributor

shntnu commented Apr 6, 2023

Sorry, I totally missed that.

it looks like load_data.csv.gz has all 384 wells, but both load_data_with_illum.csv.gz and load_data_with_illum.parquet have only 70 wells.

load_data.csv.gz seems to have the same number of rows as load_data_with_illum.csv.gz. can you verify @niranjchandrasekaran ?

aws s3 cp s3://cellpainting-gallery/cpg0016-jump/source_10/workspace/load_data_csv/2021_08_12_U2OS_48_hr_run15/Dest210803-160702/load_data.csv.gz -|gunzip - |wc -l
#     417

aws s3 cp s3://cellpainting-gallery/cpg0016-jump/source_10/workspace/load_data_csv/2021_08_12_U2OS_48_hr_run15/Dest210803-160702/load_data_with_illum.csv.gz -|gunzip - |wc -l
#     417

That said, I confirmed that the images folder does have all 384 wells

aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_10/images/2021_08_12_U2OS_48_hr_run15/images/Dest210803-160702/ |gzip > ~/Desktop/source_10_2021_08_12_U2OS_48_hr_run15_Dest210803-160702.txt.gz

gzcat ~/Desktop/source_10_2021_08_12_U2OS_48_hr_run15_Dest210803-160702.txt.gz|grep tif$|grep Dest21|tr -s " "|cut -d" " -f4|cut -d"_" -f2|sort|uniq -c|wc -l
#      384

gzcat ~/Desktop/source_10_2021_08_12_U2OS_48_hr_run15_Dest210803-160702.txt.gz|grep tif$|grep Dest21|tr -s " "|cut -d" " -f4|cut -d"_" -f2|sort|uniq -c|cut -d" " -f3|sort |uniq -c
# 384 48

@shntnu shntnu changed the title Metadata for a well is missed. Thanks Some wells in load_data are missing (but are present in wells.csv.gz) Apr 6, 2023
@niranjchandrasekaran
Copy link
Contributor

niranjchandrasekaran commented Apr 6, 2023

load_data.csv.gz seems to have the same number of rows as load_data_with_illum.csv.gz. can you verify @niranjchandrasekaran ?

You are right. load_data.csv.gz also has the same number of wells as the two illum files.

@shntnu
Copy link
Contributor

shntnu commented Apr 6, 2023

@NinoDui Thank you so much for flagging this! Can you help us report how prevalent this issue is?

Here's how you'd do it

use https://github.com/broadinstitute/position-effect-correction/blob/aabfac2de536447ad095489be6eca73f3a5ae026/1.load/load.py

import pandas as pd

source_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13]

df_all = pd.DataFrame()

for source_id in source_ids:

    df = load(
        dataset="cpg0016-jump",
        source=f"source_{source_id}}",
        component="load_data_csv",
        batch="2021_06_14_Batch6",
        plate="BR00121429",
        columns=["Metadata_Source", "Metadata_Plate", "Metadata_Well"],
    )

    # keep only distinct rows
    df = df.drop_duplicates()

    df_all = df_all.append(df)

Now report which wells are present in well.csv.gz but absent in df_all

Warning: I haven't checked the code (that was autogenerated by Copilot :D)

@NinoDui
Copy link
Author

NinoDui commented Apr 14, 2023

I've just checked with my own comparison script and found 125 mismatches (at plate level) between the number of wells available on a plate and those provided by corresponding load_data_with_illum.parquet.

Results

The result is attached.

well_diff.csv

  • n_well_on_plate: number of wells calculated by (meta) plate.csv.gz and wells.csv.gz
  • n_well_from_parquet: number of wells retrieved from load_data_with_illum.parquet

Script

How the wells from parquet are counted:

S3_LOADDATA_FORMATTER = (
    "s3://cellpainting-gallery/cpg0016-jump/"
    "{Metadata_Source}/workspace/load_data_csv/"
    "{Metadata_Batch}/{Metadata_Plate}/load_data_with_illum.parquet"
)

def fetch_well_from_parquet(row: pd.Series):
    meta_path = S3_LOADDATA_FORMATTER.format(**row)
    meta = pd.read_parquet(meta_path, storage_options=REMOTE_STORAGE_OPTION)
    wells_from_parquet = meta['Metadata_Well'].unique()
    return wells_from_parquet.shape[0]

How the number of wells from plate & well metas are calculated:

wells_info = wells.merge(plates, on=['Metadata_Source', 'Metadata_Plate'])
wells_info = wells_info[["Metadata_Source", "Metadata_Batch", "Metadata_Plate", "Metadata_Well"]]
well_count = wells_info.groupby(['Metadata_Source', 'Metadata_Batch', 'Metadata_Plate'])\
        .agg(n_well_on_plate=('Metadata_Well', pd.Series.count))

@NinoDui Thank you so much for flagging this! Can you help us report how prevalent this issue is?

Here's how you'd do it

use https://github.com/broadinstitute/position-effect-correction/blob/aabfac2de536447ad095489be6eca73f3a5ae026/1.load/load.py

import pandas as pd

source_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13]

df_all = pd.DataFrame()

for source_id in source_ids:

    df = load(
        dataset="cpg0016-jump",
        source=f"source_{source_id}}",
        component="load_data_csv",
        batch="2021_06_14_Batch6",
        plate="BR00121429",
        columns=["Metadata_Source", "Metadata_Plate", "Metadata_Well"],
    )

    # keep only distinct rows
    df = df.drop_duplicates()

    df_all = df_all.append(df)

Now report which wells are present in well.csv.gz but absent in df_all

Warning: I haven't checked the code (that was autogenerated by Copilot :D)

@shntnu
Copy link
Contributor

shntnu commented Apr 14, 2023

This is so helpful! Thank you very much.

Looking at your CSV file, I am relived to note that the only issue across the entire dataset is with that one plate you reported originally. For all others, there are more wells with images than with profiles, and this can happen, as mentioned in one of my previous comments above.

We will recreate the load data parquet for that plate.

This might take a while until we get to it. Will that block you?

Thanks again!

April 2024 Update: I took the CSV in #61 (comment) and turned it into a table below for easy searching.

weird means more_wells_with_profiles_than_with_images

Metadata_Source Metadata_Plate n_well_on_plate n_well_from_parquet weird
source_1 UL001659 1471 1472 FALSE
source_1 UL001661 1355 1472 FALSE
source_1 UL001649 1471 1472 FALSE
source_1 UL000083 1419 1472 FALSE
source_1 UL000093 1471 1472 FALSE
source_1 UL001769 1471 1472 FALSE
source_1 UL001771 1471 1472 FALSE
source_1 UL000597 1309 1472 FALSE
source_1 UL001717 1471 1472 FALSE
source_10 Dest210608-152610 383 384 FALSE
source_10 Dest210608-153057 383 384 FALSE
source_10 Dest210614-164418 382 384 FALSE
source_10 Dest210614-164906 383 384 FALSE
source_10 Dest210615-151336 383 384 FALSE
source_10 Dest210622-144809 383 384 FALSE
source_10 Dest210726-155958 383 384 FALSE
source_10 Dest210726-160957 383 384 FALSE
source_10 Dest210726-162012 383 384 FALSE
source_10 Dest210803-160702 384 70 TRUE
source_10 Dest210809-134534 383 384 FALSE
source_10 Dest210810-173723 383 384 FALSE
source_10 Dest210823-173617 383 384 FALSE
source_11 EC000004 383 384 FALSE
source_11 EC000038 250 384 FALSE
source_11 EC000045 189 384 FALSE
source_11 EC000048 126 384 FALSE
source_11 EC000068 382 383 FALSE
source_11 LM37-70_1 383 384 FALSE
source_11 EC000137 250 384 FALSE
source_11 EC000115 378 381 FALSE
source_11 EC000121 381 383 FALSE
source_11 EC000094 360 384 FALSE
source_11 LM71-102_1 383 384 FALSE
source_11 LM71-102_2 382 384 FALSE
source_2 1053601763 383 384 FALSE
source_2 1053601770 383 384 FALSE
source_2 1053601831 383 384 FALSE
source_2 1053601909 383 384 FALSE
source_2 1053600674 383 384 FALSE
source_2 1053600728 383 384 FALSE
source_2 1053600803 383 384 FALSE
source_2 1053600810 383 384 FALSE
source_2 1053600834 383 384 FALSE
source_2 1053600858 383 384 FALSE
source_2 1053600872 383 384 FALSE
source_2 1053599503 378 384 FALSE
source_2 1053599602 383 384 FALSE
source_2 1053599657 383 384 FALSE
source_2 1086293911 383 384 FALSE
source_2 1086293997 383 384 FALSE
source_2 1086293423 383 384 FALSE
source_2 1086293447 383 384 FALSE
source_2 1086293485 382 384 FALSE
source_2 1086293027 383 384 FALSE
source_2 1086293034 383 384 FALSE
source_2 1086293133 383 384 FALSE
source_2 1086293164 383 384 FALSE
source_2 1086293232 383 384 FALSE
source_2 1086292723 381 384 FALSE
source_2 1086292761 383 384 FALSE
source_2 1086292884 383 384 FALSE
source_2 1086292976 383 384 FALSE
source_2 1086292389 383 384 FALSE
source_2 1086292396 383 384 FALSE
source_2 1086292440 382 384 FALSE
source_2 1086292495 382 384 FALSE
source_2 1086291931 383 384 FALSE
source_2 1086291948 382 384 FALSE
source_2 1086291962 382 384 FALSE
source_2 1086291979 382 384 FALSE
source_2 1086291986 382 384 FALSE
source_2 1086292013 381 384 FALSE
source_2 1086292037 383 384 FALSE
source_2 1086292044 382 384 FALSE
source_2 1086292051 382 384 FALSE
source_2 1086292075 380 384 FALSE
source_2 1086292082 383 384 FALSE
source_2 1086292099 380 384 FALSE
source_2 1086292105 382 384 FALSE
source_2 1086292136 382 384 FALSE
source_2 1086292143 382 384 FALSE
source_2 1086292150 383 384 FALSE
source_2 1086289792 383 384 FALSE
source_3 BR5874c3 383 384 FALSE
source_3 SP01P04b 381 384 FALSE
source_3 B040203d 382 384 FALSE
source_3 A13407bW 383 384 FALSE
source_3 P24P27dW 319 384 FALSE
source_3 DMSOC45 383 384 FALSE
source_3 JCPQC036 383 384 FALSE
source_3 BAY5872a 383 384 FALSE
source_3 BAY5874b 382 384 FALSE
source_3 BAY5874c 383 384 FALSE
source_3 BAY5875a 383 384 FALSE
source_3 BAY5875c 383 384 FALSE
source_3 BAY5876d 383 384 FALSE
source_4 BR00121436 383 384 FALSE
source_4 BR00121428 383 384 FALSE
source_4 BR00125181 383 384 FALSE
source_4 BR00125638 382 384 FALSE
source_4 BR00121427 383 384 FALSE
source_4 BR00121426 383 384 FALSE
source_4 BR00127147 382 384 FALSE
source_5 ACPJUM032 382 384 FALSE
source_5 APTJUM230 383 384 FALSE
source_5 ATSJUM107 383 384 FALSE
source_6 1.1E+11 383 384 FALSE
source_6 1.1E+11 383 384 FALSE
source_6 1.1E+11 383 384 FALSE
source_6 1.1E+11 383 384 FALSE
source_6 1.1E+11 383 384 FALSE
source_7 CP1-SC1-08 381 384 FALSE
source_7 CP1-SC1-09 383 384 FALSE
source_8 A1170444 383 384 FALSE
source_8 A1170509 383 384 FALSE
source_8 A1170531 383 384 FALSE
source_8 A1170533 382 384 FALSE
source_8 A1170538 383 384 FALSE
source_8 A1170542 382 384 FALSE
source_8 A1170543 383 384 FALSE
source_8 A1170544 383 384 FALSE
source_8 A1166130 383 384 FALSE
source_8 A1166132 382 384 FALSE
source_8 A1166134 383 384 FALSE
source_8 A1166137 383 384 FALSE

@NinoDui
Copy link
Author

NinoDui commented Apr 18, 2023

Cool! Glad to know the issue led to limited effect. Thanks for all the checking and effort behind.

I am taking on DL Model experiments based on the images and I could exclude the issued one manually. That's not blocking for short.

Still hope to hear the issue be settled and benefit a larger group of researchers. You're actually doing something not only meaningful but cool.

Best regards,
NinoDui

This is so helpful! Thank you very much.

Looking at your CSV file, I am relived to note that the only issue across the entire dataset is with that one plate you reported originally. For all others, there are more wells with images than with profiles, and this can happen, as mentioned in one of my previous comments above.

We will recreate the load data parquet for that plate.

This might take a while until we get to it. Will that block you?

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cpg0016
Projects
None yet
Development

No branches or pull requests

3 participants