-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compleasm output also cds nucleotide sequences for the final set of genes #19
Comments
i just noticed another detail which make it a bit difficult to extract the single copy gene CDS from the GFF files. out of >5300 genes in the Hymenoptera ODB10 dataset, 11 were found twice, i.e. with the same BUSCO-ID_X (see below). The other cases have incremental numbers and are unique, and then one of those is picked for the final result apparently. For the identical IDs in the GFF this is the case too (only one of them is picked), but I would need to check each case to know which one is the correct one. Not sure if this an intended behavior double entry for 6888at7399_5 |
Hi @estolle, The output file |
thanks for the pointer. Indeed the full table has some info I can use and I think I can get the info I need now. Yet, if compleasm internally would know the cds and could output a multi-fasta (according to the gene marker fasta, that would be very helpful. One issue I still see is this duplication of a BUSCO genes entry which is used for two different gene IDs in 2 different places, both marked ass "single". Yet the BUSCO Gene ID is identical. The difference is one has "Rank=1" and one "Rank=2" and the Rank1-verson is the final one in gene marker fasta and full table. So at the moment I am filtering out entries not matching Rank=1. Do you have an idea how this duplication can be explained? cat miniprot_output.single.frag.gff | grep "6888at7399_5" WINE01001833.1 miniprot mRNA 26038 29230 2075 + . ID=MP049008;Rank=2;Identity=0.6583;Positive=0.7722;Target=6888at7399_5 1 622 |
Hi
Thank for this great software.
not a bug but a question/enhancement request:
Is there a way to also output the coding sequence for each final BUSCO gene? This would be really useful in certain phylogenetic analyses. Currently I managed to get it from the miniprot gff output file, but have to cross-check the gene marker fasta output first for the variant of the respective BUSCO gene used and then extract the cds from there.
The text was updated successfully, but these errors were encountered: