Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

README updated for independent samples module #51

Merged
merged 2 commits into from
Jul 13, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 36 additions & 10 deletions analyses/independent-samples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,50 @@ Many analyses that involve mutation frequencies or co-occurence require that all
However, the PBTA+GMKF data set includes many cases where multiple speciments were taken from a single individual.
This analysis creates lists of samples such that there are no cases where more than one specimen is included from each individual.

As different analyses may require different sets of data, we actually generate a few different sets, stored in the `results` subdirectory:
As different analyses may require different sets of data, we actually generate a few different sets, stored in the `results` subdirectory. We also run the analyses based on different 'independent_level', either 'each-cohort' or 'all-cohorts'. When running with 'each-cohort', we call independent samples (based on Kids_First_Participant_ID) for each cohort+cancer_type - and same samples (based on Kids_First_Participant_ID) in different cohorts are called "independent". When running with 'all-cohorts', we call independent samples (based on Kids_First_Participant_ID) regardless of cohort or cancer_type - and same samples (based on Kids_First_Participant_ID) in different cohorts are considered the same.

The following output are generated when we run with 'all-cohorts'
* Primary specimens only with whole genome sequence (WGS):
`independent-specimens.wgs.primary.tsv`
* Secondary specimens with WGS:
`independent-specimens.wgs.secondary.tsv`
* Primary and secondary specimens with WGS:
* Relapse specimens with WGS:
`independent-specimens.wgs.relapse.tsv`
* Primary and relapse specimens with WGS:
`independent-specimens.wgs.primary-plus.tsv`
* Primary specimens only with either WGS or whole exome sequence (WXS) or Panel:
`independent-specimens.wgswxspanel.primary.tsv`
* Secondary specimens only with either WGS or whole exome sequence (WXS) or Panel:
`independent-specimens.wgswxspanel.secondary.tsv`
* Primary and secondary specimens with WGS or WXS or Panel:
* Relapse specimens only with either WGS or whole exome sequence (WXS) or Panel:
`independent-specimens.wgswxspanel.relapse.tsv`
* Primary and relapse specimens with WGS or WXS or Panel:
`independent-specimens.wgswxspanel.primary-plus.tsv`

* Primary and secondary RNA-Seq specimens matching WGS/WXS/Panel independent sample_ids plus only-RNA-Seq
The following output are generated when we run with 'each-cohort'
* Primary specimens only with whole genome sequence (WGS):
`independent-specimens.wgs.primary.eachcohort.tsv`
* Relapse specimens with WGS:
`independent-specimens.wgs.relapse.eachcohort.tsv`
* Primary and relapse specimens with WGS:
`independent-specimens.wgs.primary-plus.eachcohort.tsv`
* Primary specimens only with either WGS or whole exome sequence (WXS) or Panel:
`independent-specimens.wgswxspanel.primary.eachcohort.tsv`
* Relapse specimens only with either WGS or whole exome sequence (WXS) or Panel:
`independent-specimens.wgswxspanel.relapse.eachcohort.tsv`
* Primary and relapse specimens with WGS or WXS or Panel:
`independent-specimens.wgswxspanel.primary-plus.eachcohort.tsv`

Simiarly, for independent RNA sapmles, we also run with either 'all-cohorts' or 'each-cohort'.
When run with 'each-cohort', independent DNA samples ran with 'each-cohort' was used as starting point (see code for details) and when run with 'all-cohorts', independent DNA samples ran with 'all-cohorts' was used as starting point.

The following output are generated when we run with 'all-cohorts'
* Primary and relapse RNA-Seq specimens matching WGS/WXS/Panel independent sample_ids plus only-RNA-Seq
`independent-specimens.rnaseq.primary-plus.tsv`
`independent-specimens.rnaseq.primary.tsv`
`independent-specimens.rnaseq.relapse.tsv`

The following output are generated when we run with 'each-cohort'
* Primary and relapse RNA-Seq specimens matching WGS/WXS/Panel independent sample_ids plus only-RNA-Seq
`independent-specimens.rnaseq.primary-plus.eachcohort.tsv`
`independent-specimens.rnaseq.primary.eachcohort.tsv`
`independent-specimens.rnaseq.relapse.eachcohort.tsv`

## Generating sample lists

Expand All @@ -39,11 +66,10 @@ bash analyses/independent-samples/run-independent-samples.sh
```

## Methods

When presented with more than one specimen from a given individual with a specific cancer group and cohort, the script randomly selects one specimen to include, with preference for primary tumors and whole genome sequences where available.
There is also a preference for the earliest collected samples, but as this data is not currently available, that code is currently deleted.

When multiple RNA-Seq samples exist per participant, the script matches the independent whole genome or whole exome sample_ids to gather matched RNA-Seq sample. If participant has onle RNA-Seq sample then a primary (and secondary if applicable) sample is randomly selected per participant per cancer group per cohort.
When multiple RNA-Seq samples exist per participant, the script matches the independent whole genome or whole exome sample_ids to gather matched RNA-Seq sample. If participant has onle RNA-Seq sample then a primary (and relapse if applicable) sample is randomly selected per participant per cancer group per cohort.

## Relevant links
The methods are described in the manuscript here:
Expand Down
4 changes: 2 additions & 2 deletions analyses/independent-samples/independent-samples.R
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
#' @param tumor_types Designates which types of tumors will be included. Options
#' are "primary" to include only primary tumors, "prefer_primary" to include
#' primary tumors when available, but fall back to other types, or "any" to
#' randomly select among all available specimens. As of v5, primary tumors
#' are defined as those designated "Initial CNS Tumor" in the
#' randomly select among all available specimens. As of v6, primary tumors
#' are defined as those designated "Initial CNS Tumor" or "Primary Tumor" in the
#' `tumor_descriptor` field.
#' @param seed An optional random number seed.
#'
Expand Down