-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotate SNV table with mutation frequencies #45
Conversation
Squashed commit of the following: commit d87986b7ce1a517f4807430ce6beaac5950b50ca Author: logstar <[email protected]> Date: Wed Jun 30 17:21:59 2021 -0400 Rename mutation-frequencies to snv-frequencies Rename module. commit b2d2fd5c391b43214825e7b458d0edcb5ac22f1a Author: logstar <[email protected]> Date: Wed Jun 30 17:13:18 2021 -0400 Annotate SNV table with mutation frequencies Issues addressed: - <d3b-center/ticket-tracker-OPC#64> - <d3b-center/ticket-tracker-OPC#8>. This issue is no longer compatible with the purpose of this module. This module intends to compute mutation frequencies for each variant, but this issue intents to compute the mutation frequencies for each gene. This issue is listed here for future reference. commit 84cacf28927121037f4b9ba895e5baa5d12c7b31 Author: logstar <[email protected]> Date: Wed Jun 30 16:23:20 2021 -0400 [WIP] Update run-mutation-frequencies.sh commit 29ae8ef19f2339ae08f78c26ab42e6cf75d3556e Author: logstar <[email protected]> Date: Wed Jun 30 16:14:50 2021 -0400 [WIP] Generate annotated SNV frequency table commit 2cb06741ca192f77a3043d03574649a184459b11 Author: logstar <[email protected]> Date: Wed Jun 30 14:54:39 2021 -0400 [WIP] Replace NA with blank string Also replaced HotSpot value 1 with Y and 0 with N. commit 57776f61e576a5e3e2672370370fd1090f3aa478 Author: logstar <[email protected]> Date: Wed Jun 30 14:02:13 2021 -0400 [WIP] Use mygene.info to query gene IDs mygene.info seems to be actively maintained. The query results are more comprehensive than [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html). Relevant URLs: - <http://mygene.info/about> - <https://bioconductor.org/packages/release/bioc/html/mygene.html> mygene.info is suggested by @taylordm and @jharenza. commit 76bb0f5236378648adce429e45d3827009735b58 Author: logstar <[email protected]> Date: Wed Jun 30 10:55:30 2021 -0400 [WIP] Generate SNV frequency tables Issue addressed: d3b-center/ticket-tracker-OPC#64
Add snv-frequencies module in the table.
When there is no primary or relapse samples, mutation frequencies are 0/0 = NaN in R. Replace 'NaN%' in the output with '', in order to be consistent with missing values in other columns.
Update to this PR: Replace When there is no primary or relapse sample in a In the result table, replace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @logstar! Thanks for the quick PR for this table!
Now that #34 is in, will you update this with the files in v6? This will change your CBTN dataset/cohort to PBTA and will condense a few cancer groups. In addition, per our call on Wednesday, we want to add the EFO and MONDO columns to this file, and those can be added with a merge of efo-mondo-map.tsv
. Note, this is not yet complete - we only have about half of the cancer groups in there so far, but it is a start.
For CBTN&GMKF&TARGET
in Dataset, can you recode this to PedOT
- we will use this general designation to mean inclusive of all datasets.
I also noticed that you have some NAs in SIFT, PolyPhen, and some other columns when I read this into R using read_tsv()
, but it looks like if I do:
freq <- read_tsv("~/Documents/GitHub/ped-ot/OpenPedCan-analysis/analyses/snv-frequencies/results/snv-consensus-annotated-mut-freq.tsv",
na=character(0))
the blanks are retained, so I wanted to note that here because when we do the JSON conversions, which may be in staggered PRs, we will have to add that argument.
Overall, the format looks good, however, I am worried about the results - for instance in neuroblastoma, the expected number of ALK mutations in primary tumors should be ~10%, but we are only seeing ~5.85% and this doesn't align with the current GMKF pedcbio study which has a frequency of 9%. There are some mutations in pedcbio we are not capturing here, but I have a suspicion that the study loaded in pedcbio is only using Strelka2 data and not a consensus of 4 callers, so, I have to check into this a bit. Regardless, even with consensus, this number should be higher. cc @migbro @yuankunzhu
Thank you for the review @jharenza !
Yes. I will update this with the v6 data release files and add EFO and MONDO columns.
Sure. Will do.
Good catch! I will add it in the README.md. It will be helpful for anyone who wants to read the tsv table. For JSON conversion, I was wondering if I could add
I see a similar ~6.10% ALK overall SNV frequency in my updated |
Add note on how to read snv-consensus-annotated-mut-freq.tsv without converting blank strings to NAs.
Need the Dockerfile update to install bioc mygene.
Updates to this PR:
|
gzip JSON to save space.
Update to this PR:
Output jsonlite::write_json(
m_mut_freq_tbl,
file.path(tables_dir, 'snv-consensus-annotated-mut-freq.json'))
# --no-name option stops the filename and timestamp from being stored in the
# output file. So rerun will have the same file.
gzip --no-name results/snv-consensus-annotated-mut-freq.json |
There is one duplicate of ENSG-HUGO mapping. Also added additional assertions.
Move RMTL column after Gene_symbol. Rename Variant_ID to Variant_ID_hg38.
Use `%>% mutate_all(function(x) replace_na(x, replace = ''))` to replace NA/NaNs in all columns with ''. Remove the added column that is not used in the script.
In run-snv-frequencies.sh, remove existing results dir before running 01-snv-frequencies.R, so gzip will not ask for confirmation to overwrite.
Hi @logstar just wanted to ask if you envision 1 module for each data type like |
Thank you for asking. I think 1 module for each data type might be more efficient to implement, document, maintain, and review. If we need to combine the tables into one in the future, we could add another module for combining If there is any reason that we should have one module for all data types, we could also proceed with that. |
Removed the `rm -r results` part in run-rna-seq-expression-summary-stats.sh, as suggested by @jharenza at d3b-center#27 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code overall is well documented and implements all the checks 👍
Do you think we could divide the 01
script to have all your functions in 1 or multiple utils
scripts in this module to gather the counts given a Kids_First_Biospecimen_ID and a Variant_ID/Alt_ID and another 01
script to read mutation/annotation files, call this function and left_join the counts?
Just wanted to discuss this since I think this would a good way to have reusable function for other data types ( I did play around your code and seem to have one for fusion. which I can add as PR for your review) . Thoughts?
Update README.md for recent patches.
Thank you for the review @kgaonkar6 ! I agree that we could reuse the workflow design of this module. I will also extract a function to generate the frequency related columns and put it in utils, in order to reuse it to generate the gene-level tables that we discussed this morning. If you need the function immediately, you could adapt the linked code for your own purposes, https://github.com/logstar/OpenPedCan-analysis/blob/f70645b6c7e4eb15ea29e45e9ebf0adeb5798b9b/analyses/snv-frequencies/01-snv-frequencies.R#L146-L227. However, I woud prefer to use the code from other modules by adapting them rather than directly resuing via |
test_num_to_pct_chr.R tests the num_to_pct_chr function in 01-snv-frequencies.R.
Test get_cohort_set_value in tests/test_get_cohort_set_value.R.
get_pcb_pot_csi is tested in tests/test_get_pcb_pot_csi.R.
Test get_pcb_pot_plot_url in tests/test_get_pcb_pot_plot_url.R.
Put the imported function in the same environment as the import_function being called, so the imported functions can call other imported functions.
Require Kids_First_Biospecimen_ID and Kids_First_Participant_ID columns exist and have no NA in overall_histology_df, primary_histology_df, and relapse_histology_df.
get_opr_mut_freq_tbl is the core function to compute mutation frequencies. get_opr_mut_freq_tbl is called by get_cg_ch_var_level_mut_freq_tbl and get_cg_ch_gene_level_mut_freq_tbl to compute mutation frequencies. get_cg_ch_var_level_mut_freq_tbl and get_cg_ch_gene_level_mut_freq_tbl are not tested, because they only format the mutation frequency tables generated by get_opr_mut_freq_tbl.
@jharenza @kgaonkar6 I have added unit testing for this PR and removed test case comments in All tests passed in the Docker image. See the commits from c58c8d5 for more details.
I will work on d3b-center/ticket-tracker-OPC#122 before incorporating the |
In import_function, envir = parent.frame() makes the environment(imported_function) to be the same as the import function being called. Even though the eval documentation says the default envir parameter is parent.frame(), leaving envir = parent.frame() in the eval call will surprisingly make the environment(imported_function) to be the same as the environment of the import_function call. The tests also fail without specifying envir = parent.frame(). Maybe this is caused by the place where parent.frame() is evaluated? If specified, parent.frame() is evaluated in the calling function; if not specified, parent.frame() is evaluated in the eval call environment? Examples executed in terminal R, in order to avoid RStudio customizations. > print(environment()) <environment: R_GlobalEnv> > > foo <- function() { + print(parent.frame()) + print(environment()) + return(eval(quote(function() { return(2) }))) + } > > bar <- foo() <environment: R_GlobalEnv> <environment: 0x5645ead67670> > print(environment(bar)) <environment: 0x5645ead67670> > > baz <- function() { + print(parent.frame()) + print(environment()) + return(eval(quote(function() { return(2) }), + envir = parent.frame())) + } > > qux <- baz() <environment: R_GlobalEnv> <environment: 0x5645ead5fec0> > print(environment(qux)) <environment: R_GlobalEnv>
Added WIP label to this PR. I will use the I will keep using v6 data and independent sample lists in this PR, and I will create a ticket for rerunning the |
@jharenza @kgaonkar6 Sorry for changing the plan. This PR will not be further changed for integrating the I will create a new PR for integrating the Following is the clarification for the forced push: The forced push is to completely revert merging with the The forced push completely reverted this PR to the state of this commit 861be07, which is the latest one before merging with the I assumed no one has pulled the A merge can also be reverted with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice job on this - two minor comments about removing redundant json files and renaming the variant level file, and once addressed, this can be merged.
Remove results/gene-level-snv-consensus-annotated-mut-freq.json.gz and results/var-level-snv-consensus-annotated-mut-freq.json.gz.
As suggested by @jharenza at <d3b-center#45 (comment)>, use variant-level, rather than var-level, in the variant-level filenames.
Purpose/implementation Section
What scientific question is your analysis addressing?
Annotate each non-synonymous variant in
snv-consensus-plus-hotspots.maf.tsv.gz
with mutation frequencies per(cohort, cancer_group, primary/relapse)
.What was your approach?
Subset
snv-consensus-plus-hotspots.maf.tsv.gz
to keep only samples withsample_type == 'Tumor'
.Subset
snv-consensus-plus-hotspots.maf.tsv.gz
to keep only non-synonymous variants with the following code.Create a
Variant_ID
for each variant by concatenatingChromosome
,Start_Position
,Reference_Allele
, andTumor_Seq_Allele2
with'_'
.Add
Gene_full_name
andProtein_RefSeq_ID
columns to each variant with annotations obtained from mygene.info.For each
cancer_group
, get each cohort and all cohorts. Call eachcancer_group
andcohort
(s) combination as acancer_group_cohort
. For example,For each
cancer_group_cohort
withn_samples
>= 5, computeFrequency_in_overall_dataset
,Frequency_in_primary_tumors
, andFrequency_in_relapse_tumors
as following:Frequency_in_overall_dataset
:Kids_First_Participant_ID
) that have the variant, and call this numberTotal_mutations
.cancer_group_cohort
, and call this numberPatients_in_dataset
.Frequency_in_overall_dataset = Total_mutations / Patients_in_dataset
.Frequency_in_primary_tumors
:Kids_First_Biospecimen_ID
) that are in the../independent-samples/results/independent-specimens.wgs.primary.tsv
, and call this numberTotal_primary_tumors_mutated
.cancer_group_cohort
that are also in the../independent-samples/results/independent-specimens.wgs.primary.tsv
, and call this numberPrimary_tumors_in_dataset
.Frequency_in_primary_tumors = Total_primary_tumors_mutated / Primary_tumors_in_dataset
.Frequency_in_relapse_tumors
:Kids_First_Biospecimen_ID
) that are in the../independent-samples/results/independent-specimens.wgs.relapse.tsv
, and call this numberTotal_relapse_tumors_mutated
.cancer_group_cohort
that are also in the../independent-samples/results/independent-specimens.wgs.relapse.tsv
, and call this numberRelapse_tumors_in_dataset
.Frequency_in_relapse_tumors = Total_relapse_tumors_mutated / Relapse_tumors_in_dataset
.Format the SNV mutation frequency table according to the latest spreadsheet that is attached in d3b-center/ticket-tracker-OPC#64.
Merge the SNV mutation frequency tables of all
cancer_group_cohort
s.What GitHub issue does your pull request address?
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
The
get_cg_ch_mut_freq_tbl
function for computing frequencies.The
get_cg_cs_tbl
function for gettingcacner_group_cohort
s.Is there anything that you want to discuss further?
Is it OK to retain the
chr
prefix in theChromosome
when creatingIdentifer hg38 (See ClinVar ex)
(Variant_ID
in the result table)? The example value is10_102599545_G_A
in the spreadsheet, but it will bechr10_102599545_G_A
in the output table.I will add RMTL and COSMIC annotations when
CosmicMutantExportCensus.tsv
andRMTL_Expanded_3_RMTL_for_OT.csv
are available in future data releases.I will work on d3b-center/ticket-tracker-OPC#82 to add oncoKB annotations to the result table.
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes.
Results
What types of results are included (e.g., table, figure)?
Table
What is your summary of the results?
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.