-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of salmon counts for DESeq2 #499
Comments
Thank you for reporting this @tamuanand :) Yes, I included the Salmon PCA plots as a bonus really because I got carried away with the DSL2 modularisation of the pipeline. I am guilty of not using Salmon very much. I think @olgabot @lpantano added this functionality to the pipeline. How would you propose we fix this? |
we can use a parameter to the pipeline and this passed to the R script? that would be the countsFromAbundance option? |
I would propose that the pipeline can give all possible outputs from Salmon+tximport, but only
This is how I have implemented within my code - In my case, I have quant.sf within foldernames #!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
# test if there is at least one argument: if not, return an error
if (length(args) < 1) {
stop("This script requires at least 2 arguments\nArg1=path to directory that has sample info which then have quant.sf file in
their subfolders\nArg2=Folder path to a directory that has a 2 column file that ends in tx2gene.tsv - column1 is transcript, column2 is gene\n",
call.=FALSE)
}
get_quants <- function(path1, ...) {
additionalPath = list(...)
suppressMessages(library(tximport))
suppressMessages(library(readr))
salmon_filepaths=file.path(path=path1,list.files(path1,recursive=TRUE, pattern="quant.sf"))
samples = data.frame(samples = gsub(".*?quant/salmon_(.*?)/quant.sf", "\\1", salmon_filepaths) )
row.names(samples)=samples[,1]
names(salmon_filepaths)=samples$samples
# no tx2Gene available
salmon_tx_data_1 = tximport(salmon_filepaths, type="salmon",
txOut = TRUE)
salmon_tx_data_2 = tximport(salmon_filepaths, type="salmon",
txOut = TRUE, countsFromAbundance = "lengthScaledTPM")
salmon_tx_data_3 = tximport(salmon_filepaths, type="salmon",
txOut = TRUE, countsFromAbundance = "scaledTPM")
## Get transcript count summarization
write.csv(as.data.frame(salmon_tx_data_1$counts),
file = "tx_NumReads.csv", quote=FALSE)
write.csv(as.data.frame(salmon_tx_data_2$counts),
file = "lengthScaledTPM_tx_NumReads.csv", quote=FALSE)
write.csv(as.data.frame(salmon_tx_data_3$counts),
file = "scaledTPM_tx_NumReads.csv", quote=FALSE)
## Get TPM
write.csv(as.data.frame(salmon_tx_data_1$abundance),
file = "tx_TPM_Abundance.csv", quote=FALSE)
write.csv(as.data.frame(salmon_tx_data_2$abundance),
file = "lengthScaledTPM_tx_TPM_Abundance.csv", quote=FALSE)
write.csv(as.data.frame(salmon_tx_data_3$abundance),
file = "scaledTPM_tx_TPM_Abundance.csv", quote=FALSE)
rm(salmon_tx_data_1)
rm(salmon_tx_data_2)
rm(salmon_tx_data_3)
#tx2gene available
if(length(additionalPath > 0)) {
tx2geneFile = additionalPath[[1]]
my_tx2gene=read.csv(tx2geneFile,sep = "\t",stringsAsFactors = F, header=F)
salmon_tx2gene_data_1 = tximport(salmon_filepaths, type="salmon",
txOut = FALSE, tx2gene=my_tx2gene)
salmon_tx2gene_data_2 = tximport(salmon_filepaths, type="salmon",
txOut = FALSE, tx2gene=my_tx2gene,
countsFromAbundance = "lengthScaledTPM")
salmon_tx2gene_data_3 = tximport(salmon_filepaths, type="salmon",
txOut = FALSE, tx2gene=my_tx2gene,
countsFromAbundance = "scaledTPM")
salmon_tx2gene_data_4 = tximport(salmon_filepaths, type="salmon",
txOut = TRUE, tx2gene=my_tx2gene,
countsFromAbundance = "dtuScaledTPM")
## Get Gene count summarization
write.csv(as.data.frame(salmon_tx2gene_data_1$counts),
file = "tx2gene_NumReads.csv", quote=FALSE)
write.csv(as.data.frame(salmon_tx2gene_data_2$counts),
file = "lengthScaledTPM_tx2gene_NumReads.csv", quote=FALSE)
write.csv(as.data.frame(salmon_tx2gene_data_3$counts),
file = "scaledTPM_tx2gene_NumReads.csv", quote=FALSE)
write.csv(as.data.frame(salmon_tx2gene_data_4$counts),
file = "dtuScaledTPM_NumReads.csv", quote=FALSE)
## Get TPM
write.csv(as.data.frame(salmon_tx2gene_data_1$abundance),
file = "tx2gene_TPM_Abundance.csv", quote=FALSE)
write.csv(as.data.frame(salmon_tx2gene_data_2$abundance),
file = "lengthScaledTPM_tx2gene_TPM_Abundance.csv", quote=FALSE)
write.csv(as.data.frame(salmon_tx2gene_data_3$abundance),
file = "scaledTPM_tx2gene_TPM_Abundance.csv", quote=FALSE)
write.csv(as.data.frame(salmon_tx2gene_data_4$abundance),
file = "dtuScaledTPM_TPM_Abundance.csv", quote=FALSE)
rm(salmon_tx2gene_data_1)
rm(salmon_tx2gene_data_2)
rm(salmon_tx2gene_data_3)
rm(salmon_tx2gene_data_4)
}
}
if(length(args) == 1){
quant_dirpath = gsub("/$", '', args[1])
get_quants(quant_dirpath)
}else if(length(args) == 2){
quant_dirpath = gsub("/$", '', args[1])
tx2gene_File = gsub("/$", '', args[2])
suppressWarnings(get_quants(quant_dirpath, tx2gene_File))
}else{
stop("This script requires at least 2 arguments\nArg1=path to directory that has sample info which then have quant.sf file in
their subfolders\nArg2=Folder path to a directory that has a 2 column file - column1 is transcript, column2 is gene\n",
call.=FALSE)
} |
this is fine with me, if this version works, you could just do the PR with this update? not sure if the output section in the pipeline needs to be updated to store all these new files into the final folder. I think is good to have all and explained in the output documentation. |
I am not really comfortable with me doing the PR part as there are so many intricacies with this workflow and I am afraid I would inadvertently break some key functionality. I can exchange messages on this thread about my code above -- I can vouch that I have pressure tested all facets on the code I shared. Code aside - the key thing to be noted in the nf-core/rnaseq documentation is that Using the code example above, I would then use |
I am planning on doing a minor release soon so be good if we can fix this. If someone is able to submit a PR to fix the salmon_tximport.r R script that generates the counts then I am happy to do the rest. Does that sound reasonable? |
I can try this week, I am planning on Wednesday, if that is ok? |
Absolutely 👌 |
I believe this has now been addressed @tamuanand. Mostly by the work from @lpantano in #519. I updated the main script to use these counts for DESeq2 here. This will use the length scaled counts generated by Salmon as initially created here. How does this look now? Are we ok to close this issue? |
@drpatelh I closed this - probably you should remove the 'bug' label on this issue? |
Thanks @tamuanand 👍🏽 We can leave the label as it is for posterity. |
I got pointed here by Rob, thanks everyone here for the effort to include Salmon and tximport in the nf-core. Just to say, I typically use:
Happy to answer any remaining questions as they come up. |
Thanks @mikelove ! I should be releasing the shiny new version of the pipeline tomorrow! I have pointed users to this issue in the documentation so hopefully, that will help us refine the docs within the pipeline once we reach a consensus on exactly what is missing. |
@drpatelh Often people just use the default kmer size of 31 from salmon Here is a post from @rob-p on k-mer selection - COMBINE-lab/salmon#482 (comment) Suggestion: If ok with you, probably the bolded/italicized note below can be added to the nf-core documentation - where this section is listed
|
Done 🙂 ccd21ea |
Hi @drpatelh
Question/Suggestion on this - https://github.com/nf-core/rnaseq/blob/master/bin/salmon_tximport.r
I believe that if one has to use Salmon
quant.sf
counts for downstream DESeq2, the counts have to be "non-normalized".See below (different links and my thread with Mike Love on tximport with salmon counts for DESeq2)
My discussions with Mike Love
The text was updated successfully, but these errors were encountered: