Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Depleting kmers shared with another reference #3180

Closed
Amanda-Biocortex opened this issue May 29, 2024 · 8 comments
Closed

Depleting kmers shared with another reference #3180

Amanda-Biocortex opened this issue May 29, 2024 · 8 comments

Comments

@Amanda-Biocortex
Copy link

Hi,

I am looking to remove kmers from a reference genome which are shared with another genome.

For example, bacteria strain A and B share many regions of DNA however also have unique regions. I would like to create a signature for strain A reference which is depleted for kmers shared with strain B (ie the ignature hold kmers unique to strain A).

Can Sourmash search be used for this?

Many thanks,
Amanda

@ctb
Copy link
Contributor

ctb commented May 29, 2024

hi @Amanda-Biocortex,

this is what gather does automatically - if you run sourmash gather <query.sig.zip> <A.sig.zip> <B.sig.zip> it will give you the match to the largest overlap between the query and either A or B, then subtract the overlap, and match against the remaining sketch (B or A). Happy to explain more if interested!

If you want to do this manually, you can use sourmash sig subtract A.sig.zip B.sig.zip -o AminusB.sig.zip. (You'll want to check the docs to be sure I got that right 😆 )

You can also use sourmash sig overlap A.sig.zip B.sig.zip to examine details of the overlap.

The venn diagram plugin will also happily plot the overlap for you. It will return the same numbers as sourmash search --containment A.sig.zip B.sig.zip.

Ask as you have questions!

@Amanda-Biocortex
Copy link
Author

Amanda-Biocortex commented May 29, 2024

Great, I will use sourmash sig subtract to create a new reference of unique kmers.

I can see from --help that I need to use --flatten to remove abundances because the original signatures were created with -p abund. Will this mean that I wont be able to use the track_abundance feature when I run sourmash gather using the reference created using sourmash sig subtract ?

Does --abundances-from FILE help reinstate the abundances so that track_abundance can be used later..?

Many thanks again for your very clear explanations!

@ctb
Copy link
Contributor

ctb commented May 29, 2024

Great, I will use sourmash sig subtract to create a new reference of unique kmers.

Great! You could also use sig intersect to generate the intersection, and subtract that from both, if you were interested in the k-mers that are distinct to each :).

I can see from --help that I need to use --flatten to remove abundances because the original signatures were created with -p abund. Will this mean that I wont be able to use the track_abundance feature when I run sourmash gather using the reference created using sourmash sig subtract ?

Does --abundances-from FILE help reinstate the abundances so that track_abundance can be used later..?

Yes! It may seem a little redundant but it's just a (slightly inefficient) way of making sure you know exactly where the abundances in the output sketch are coming from.

And... the reason I didn't mention abundance before is that (at least for bacterial and archaeal genomes) I tend to recommend not using the abundance on the genome side, as the genomes are mostly single copy. Instead, in my bac/arc-focused metagenomics work, abundances are mostly used with the metagenomes, where abundance information can be really valuable.

Note that for sourmash gather, the metagenome abundance is the only thing that's used. It's assumed that the genome sketches are flat (and ignores their abundances) because the gather algorithm doesn't apply to non-flat references.

Many thanks again for your very clear explanations!

Welcome!

@Amanda-Biocortex
Copy link
Author

Amanda-Biocortex commented May 29, 2024

Great! You could also use sig intersect to generate the intersection, and subtract that from both, if you were interested in the k-mers that are distinct to each :).

And this produce the same distinct kmers as sig subtract? I feel like every time I read your docs a new function jumps out at me! Super cool stuff

And... the reason I didn't mention abundance before is that (at least for bacterial and archaeal genomes) I tend to recommend not using the abundance on the genome side, as the genomes are mostly single copy. Instead, in my bac/arc-focused metagenomics work, abundances are mostly used with the metagenomes, where abundance information can be really valuable.

Note that for sourmash gather, the metagenome abundance is the only thing that's used. It's assumed that the genome sketches are flat (and ignores their abundances) because the gather algorithm doesn't apply to non-flat references.

Ah so for metagenomic analysis is it only the query needs to be sketched with -p abund (and not the reference genomes)?
It is a problem if I have -p abund when sketching the reference genomes?

My end goal is to use this 'subtracted' genome as part of the reference database for metagenomic analysis.

@ctb
Copy link
Contributor

ctb commented May 29, 2024

Great! You could also use sig intersect to generate the intersection, and subtract that from both, if you were interested in the k-mers that are distinct to each :).

And this produce the same distinct kmers as sig subtract? I feel like every time I read your docs a new function jumps out at me! Super cool stuff

:) but it becomes a problem to find the darn commands... the issue tracker is a great place to search as well.

Hmm, it might make the most sense to think about it as sets of k-mers - if you intersect A and B to get C, then A - C == A - B, and B - C == B - A, because k-mers that are not present in the subtractee are simply ignored.

And... the reason I didn't mention abundance before is that (at least for bacterial and archaeal genomes) I tend to recommend not using the abundance on the genome side, as the genomes are mostly single copy. Instead, in my bac/arc-focused metagenomics work, abundances are mostly used with the metagenomes, where abundance information can be really valuable.
Note that for sourmash gather, the metagenome abundance is the only thing that's used. It's assumed that the genome sketches are flat (and ignores their abundances) because the gather algorithm doesn't apply to non-flat references.

Ah so for metagenomic analysis is it only the query needs to be sketched with -p abund (and not the reference genomes)?

Correct!

It is a problem if I have -p abund when sketching the reference genomes?

No, it isn't! The abundances are just ignored. (This is an undocumented feature, so I've created an issue to document it! #3181)

My end goal is to use this 'subtracted' genome as part of the reference database for metagenomic analysis.

Yep, that will work fine! I'll be curious if you find that this approach works better than gather on the unmodified sketches (but no worries if you don't look into that comparison ;).

@Amanda-Biocortex
Copy link
Author

Hi @ctb

Ive created the custom signature using sourmash signature subtract, but when I run sourmash gather, the name field is empty (so I have no idea what bacteria is there). This is because sourmash signature subtract isnt pulling name from the first signature.

sourmash merge has the -name option, do you have that for sourmash signature subtract? In the same way that abundances can be pulled from one of the signatures using sourmash signature subtract --abundances-from, can you do that with name?

Many thanks,
Amanda

@ctb
Copy link
Contributor

ctb commented Jun 4, 2024

They're being added in #3162, but are not yet in a release, I'm afraid!

For now you'll need to use sourmash sig rename <sketches> <name>. Sorry ;(.

@ctb
Copy link
Contributor

ctb commented Jun 17, 2024

hi @Amanda-Biocortex --set-name for subtract is now available in sourmash v4.8.9, released last week!

@ctb ctb closed this as completed Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants