Skip to content
This repository has been archived by the owner on Jun 16, 2023. It is now read-only.

Updated analysis: independent-samples module #89

Closed
logstar opened this issue Jul 2, 2021 · 4 comments
Closed

Updated analysis: independent-samples module #89

logstar opened this issue Jul 2, 2021 · 4 comments
Assignees

Comments

@logstar
Copy link

logstar commented Jul 2, 2021

What analysis module should be updated and why?

The independent-samples module needs to be updated for the following purposes.

  • Create independent sample lists by cohort and also by cancer_group. Output a merged table for each (experimental_strategy, primary/primary-plus/relapse) like before.

The reason for this update is that certain primary samples are not captured if we use all samples for creating the lists. Certain patients have primary samples in different cohorts, so only one cohort of their primary samples will be randomly retained in previous lists. For example,

> library(tidyverse)
> hdf <- read_tsv('data/histologies.tsv', guess_max = 100000)
Parsed with column specification:
cols(
  .default = col_character(),
  OS_days = col_double(),
  age_last_update_days = col_double(),
  normal_fraction = col_double(),
  tumor_fraction = col_double(),
  tumor_ploidy = col_double()
)
See spec(...) for full column specifications.
> hdf %>%
  filter(tumor_descriptor %in% c("Initial CNS Tumor", "Primary Tumor"),
         cancer_group == 'Neuroblastoma') %>%
  group_by(Kids_First_Participant_ID) %>%
  summarise(n_samples = n(),
            Kids_First_Biospecimen_ID = paste(Kids_First_Biospecimen_ID, collapse = '&'),
            cohort = paste(cohort, collapse = '&'),
            cancer_group = paste(cancer_group, collapse = '&'),
            age_at_diagnosis_days = paste(age_at_diagnosis_days, collapse = '&')) %>%
  arrange(desc(n_samples)) %>%
  head()
Kids_First_Participant_ID n_samples Kids_First_Biospecimen_ID cohort cancer_group age_at_diagnosis_days
PASWYR 4 BS_8XDZQKSD&BS_MPE34NYZ&TARGET-30-PASWYR-01A-01R&TARGET-30-PASWYR-01A-01D GMKF&GMKF&TARGET&TARGET Neuroblastoma&Neuroblastoma&Neuroblastoma&Neuroblastoma 958&958&958&958
PASXHE 4 BS_EZRVK9ZQ&BS_V4VGG98Y&TARGET-30-PASXHE-01A-01R&TARGET-30-PASXHE-01A-01D GMKF&GMKF&TARGET&TARGET Neuroblastoma&Neuroblastoma&Neuroblastoma&Neuroblastoma 1438&1438&1438&1438
PASXIE 4 BS_2N95EW0G&BS_9TKGBJH7&TARGET-30-PASXIE-01A-01R&TARGET-30-PASXIE-01A-01D GMKF&GMKF&TARGET&TARGET Neuroblastoma&Neuroblastoma&Neuroblastoma&Neuroblastoma 837&837&837&837
PASXRG 4 BS_9JBYGRQW&BS_P6FPBJM8&TARGET-30-PASXRG-01A-01R&TARGET-30-PASXRG-01A-01D GMKF&GMKF&TARGET&TARGET Neuroblastoma&Neuroblastoma&Neuroblastoma&Neuroblastoma 1278&1278&1278&1278
PASXRJ 4 BS_1BKHK7AY&BS_KXRFQF5N&TARGET-30-PASXRJ-01A-01R&TARGET-30-PASXRJ-01A-01D GMKF&GMKF&TARGET&TARGET Neuroblastoma&Neuroblastoma&Neuroblastoma&Neuroblastoma 583&583&583&583
PATBMM 4 BS_3DJBSNGE&BS_D7442ACV&TARGET-30-PATBMM-01A-01R&TARGET-30-PATBMM-01A-01D GMKF&GMKF&TARGET&TARGET Neuroblastoma&Neuroblastoma&Neuroblastoma&Neuroblastoma 1112&1112&1112&1112

(Update Fri Jul 2 15:27:11 2021 by YZ) The not-captured GMKF Neuroblastoma primary samples might be causing the ALK mutation frequency discrepancy between PediatricOpenTargets and pedcbio, as described by @jharenza at d3b-center/OpenPedCan-analysis#45 (review)

  • Update experimental_strategy to %in% c("WGS", "WXS", "Targeted Sequencing") in 01-generate-independent-specimens.R and 00-repeated-samples.Rmd. In v6 release, "Targeted Sequencing", "Targeted-Capture" are harmonized to "Targeted Sequencing", as requested in histologies.tsv experimental_strategy update #62

  • Update tumor_descriptor in independent-samples.R and independent_rna_samples.R to the following. The changes are requested in histologies.tsv tumor_descriptor update  #61

primary_descs <- c("Initial CNS Tumor", "Primary Tumor")
relapse_descs <- c("Recurrence", "Progressive", "Progressive Disease Post Mortem")

What changes need to be made? Please provide enough detail for another participant to make the update.

Run the same procedure on each cohort and cancer_group. Include cohort and cancer_group fields in the result lists. Combine all result lists into one list.

Update experimental_strategy and tumor_descriptor where they are used.

What input data should be used? Which data were used in the version being updated?

histologies.tsv, which is updated in v6.

When do you expect the revised analysis will be completed?

1-3 days.

Who will complete the updated analysis?

@runjin326

cc: @jharenza

@logstar
Copy link
Author

logstar commented Jul 8, 2021

The code and results are updated in d3b-center/OpenPedCan-analysis#48.

The README.md needs to be updated for the each-cohort procedure. We would probably need to clarify that in each-cohort function calls, we actually find independent samples from each cohort and each cancer_group.

@jharenza
Copy link
Member

The code and results are updated in d3b-center/OpenPedCan-analysis#48.

The README.md needs to be updated for the each-cohort procedure. We would probably need to clarify that in each-cohort function calls, we actually find independent samples from each cohort and each cancer_group.

@runjin326 can you work on this?

@runjin326
Copy link

@jharenza, absolutely! I will do that :)

@logstar
Copy link
Author

logstar commented Jul 13, 2021

The README.md is updated in d3b-center/OpenPedCan-analysis#51.

@logstar logstar closed this as completed Jul 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants