d3b-center · logstar · Jul 13, 2021 · Jul 13, 2021 · Jul 13, 2021
@@ -6,23 +6,50 @@ Many analyses that involve mutation frequencies or co-occurence require that all
 However, the PBTA+GMKF data set includes many cases where multiple speciments were taken from a single individual.
 This analysis creates lists of samples such that there are no cases where more than one specimen is included from each individual.
 
-As different analyses may require different sets of data, we actually generate a few different sets, stored in the `results` subdirectory:
+As different analyses may require different sets of data, we actually generate a few different sets, stored in the `results` subdirectory. We also run the analyses based on different 'independent_level', either 'each-cohort' or 'all-cohorts'. When running with 'each-cohort', we call independent samples (based on Kids_First_Participant_ID) for each cohort+cancer_type - and same samples (based on Kids_First_Participant_ID) in different cohorts are called "independent". When running with 'all-cohorts', we call independent samples (based on Kids_First_Participant_ID) regardless of cohort or cancer_type - and same samples (based on Kids_First_Participant_ID) in different cohorts are considered the same.
+
+The following output are generated when we run with 'all-cohorts'
 * Primary specimens only with whole genome sequence (WGS):  
 `independent-specimens.wgs.primary.tsv`
-* Secondary specimens with WGS:  
-`independent-specimens.wgs.secondary.tsv`
-* Primary and secondary specimens with WGS:  
+* Relapse specimens with WGS:  
+`independent-specimens.wgs.relapse.tsv`
+* Primary and relapse specimens with WGS:  
 `independent-specimens.wgs.primary-plus.tsv`
 * Primary specimens only with either WGS or whole exome sequence (WXS) or Panel:  
 `independent-specimens.wgswxspanel.primary.tsv`
-* Secondary specimens only with either WGS or whole exome sequence (WXS) or Panel:  
-`independent-specimens.wgswxspanel.secondary.tsv`
-* Primary and secondary specimens with WGS or WXS or Panel:  
+* Relapse specimens only with either WGS or whole exome sequence (WXS) or Panel:  
+`independent-specimens.wgswxspanel.relapse.tsv`
+* Primary and relapse specimens with WGS or WXS or Panel:  
 `independent-specimens.wgswxspanel.primary-plus.tsv`
 
-* Primary and secondary RNA-Seq specimens matching WGS/WXS/Panel independent sample_ids plus only-RNA-Seq 
+The following output are generated when we run with 'each-cohort'
+* Primary specimens only with whole genome sequence (WGS):  
+`independent-specimens.wgs.primary.eachcohort.tsv`
+* Relapse specimens with WGS:  
+`independent-specimens.wgs.relapse.eachcohort.tsv`
+* Primary and relapse specimens with WGS:  
+`independent-specimens.wgs.primary-plus.eachcohort.tsv`
+* Primary specimens only with either WGS or whole exome sequence (WXS) or Panel:  
+`independent-specimens.wgswxspanel.primary.eachcohort.tsv`
+* Relapse specimens only with either WGS or whole exome sequence (WXS) or Panel:  
+`independent-specimens.wgswxspanel.relapse.eachcohort.tsv`
+* Primary and relapse specimens with WGS or WXS or Panel:  
+`independent-specimens.wgswxspanel.primary-plus.eachcohort.tsv`
+
+Simiarly, for independent RNA sapmles, we also run with either 'all-cohorts' or 'each-cohort'.
+When run with 'each-cohort', independent DNA samples ran with 'each-cohort' was used as starting point (see code for details) and when run with 'all-cohorts', independent DNA samples ran with 'all-cohorts' was used as starting point.
+
+The following output are generated when we run with 'all-cohorts'
+* Primary and relapse RNA-Seq specimens matching WGS/WXS/Panel independent sample_ids plus only-RNA-Seq 
 `independent-specimens.rnaseq.primary-plus.tsv`
+`independent-specimens.rnaseq.primary.tsv`
+`independent-specimens.rnaseq.relapse.tsv`
 
+The following output are generated when we run with 'each-cohort'
+* Primary and relapse RNA-Seq specimens matching WGS/WXS/Panel independent sample_ids plus only-RNA-Seq 
+`independent-specimens.rnaseq.primary-plus.eachcohort.tsv`
+`independent-specimens.rnaseq.primary.eachcohort.tsv`
+`independent-specimens.rnaseq.relapse.eachcohort.tsv`
 
 ## Generating sample lists
 
@@ -39,11 +66,10 @@ bash analyses/independent-samples/run-independent-samples.sh
 ```
 
 ## Methods
-
 When presented with more than one specimen from a given individual with a specific cancer group and cohort, the script randomly selects one specimen to include, with preference for primary tumors and whole genome sequences where available.
 There is also a preference for the earliest collected samples, but as this data is not currently available, that code is currently deleted.
 
-When multiple RNA-Seq samples exist per participant, the script matches the independent whole genome or whole exome sample_ids to gather matched RNA-Seq sample. If participant has onle RNA-Seq sample then a primary (and secondary if applicable) sample is randomly selected per participant per cancer group per cohort. 
+When multiple RNA-Seq samples exist per participant, the script matches the independent whole genome or whole exome sample_ids to gather matched RNA-Seq sample. If participant has onle RNA-Seq sample then a primary (and relapse if applicable) sample is randomly selected per participant per cancer group per cohort. 
 
 ## Relevant links
 The methods are described in the manuscript here:

@@ -16,8 +16,8 @@
 #' @param tumor_types Designates which types of tumors will be included. Options
 #'   are "primary" to include only primary tumors, "prefer_primary" to include
 #'   primary tumors when available, but fall back to other types, or "any" to
-#'   randomly select among all available specimens. As of v5, primary tumors
-#'   are defined as those designated "Initial CNS Tumor" in the
+#'   randomly select among all available specimens. As of v6, primary tumors
+#'   are defined as those designated "Initial CNS Tumor" or "Primary Tumor" in the
 #'   `tumor_descriptor` field.
 #' @param seed An optional random number seed. 
 #'