Add ability to adjust --num-hashes, --scaled on-the-fly in 'compare' and 'categorize' #560

olgabot · 2018-10-26T16:20:54Z

I'd like to be able to calculate ONE "deep" signature of e.g. 10k kmers for each sample and then adjust the scaling factor to 5k kmers, 1k kmers, 500 kmers, 100 kmers, see how that affects the similarity and nearest neighbors.

ctb · 2018-10-26T16:31:53Z

Have you tried using --scaled with search and compare?

olgabot · 2018-10-26T16:35:17Z

Thanks for the quick response! My issue is that droplet-based methods (~10k reads/cell) are much shallower than full-transcript methods (~500k reads/cell) and I'd like to use the same number of kmers/hashes across both, not just the same scaling factor.

ctb · 2018-10-26T16:50:30Z

On Fri, Oct 26, 2018 at 04:35:17PM +0000, Olga Botvinnik wrote: Thanks for the quick response! My issue is that droplet-based methods (~10k reads/cell) are much shallower than full-transcript methods (~500k reads/cell) and I'd like to use the same number of kmers/hashes across both, not just the same scaling factor.

my intuition is that neither --num nor --scaled will solve your concern here, because you simply have more k-mers to look at either way and neither method is robust to that. I think you probably want to downsample your data by number of reads instead. but that's off the cuff and could well be wrong :)

luizirber · 2018-10-26T22:22:20Z

would something like #538 help?

olgabot · 2018-10-31T19:44:03Z

@luizirber Yes, it would be very helpful to be able to switch between --scaled and --num-hashes! I have many many (10/cell, 1000 cells) that are essentially redundant because they used different scales and I didn't notice the --scaled feature in compare before.

ctb · 2018-10-31T19:55:55Z

this can be done programmatically quite easily (construct a new MinHash object, shove hashes in with add_many). I'm more skeptical that this needs to be supported by sourmash at the command-line level, or at least I feel like it would add significantly to the complexity. What do you think about us providing an example script that does the right thing, without integrating it into the command line?

olgabot · 2018-10-31T19:58:12Z

that would be very helpful! and definitely eliminate the number of redundant computations I'm doing

ctb · 2018-12-27T15:33:38Z

sorry for dropping the ball on this...

but, looking at #587, I wonder if this functionality fits under downsample? I could imagine adding a flag to the command like so:

sourmash signature downsample --num 500 file1.sig

and/or

sourmash signature downsample --scaled 1000 file1.sig

where --num and --scaled are (for the moment) incompatible, and fail when the downsampling cannot be done properly. Whaddya think?

ctb · 2018-12-27T17:28:08Z

Also, in #436 I added some example Python code :)

ctb · 2020-07-03T18:34:28Z

#1072 would resolve this, I think.

ctb · 2021-09-23T12:30:04Z

The core functionality is available in sourmash sig downsample, and the selector framework discussed in #1524 is probably the right place to implement something generic in the future.

Closing until the specific CLI functionality is requested again :)

ctb mentioned this issue Dec 31, 2018

[MRG] add 'sourmash signature' signature manipulation utilities. #587

Merged

22 tasks

ctb mentioned this issue Jan 8, 2019

can we make some of the 'sourmash signature' functions work on streams? #609

Closed

ctb closed this as completed Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to adjust --num-hashes, --scaled on-the-fly in 'compare' and 'categorize' #560

Add ability to adjust --num-hashes, --scaled on-the-fly in 'compare' and 'categorize' #560

olgabot commented Oct 26, 2018

ctb commented Oct 26, 2018 via email

olgabot commented Oct 26, 2018

ctb commented Oct 26, 2018 via email

luizirber commented Oct 26, 2018

olgabot commented Oct 31, 2018

ctb commented Oct 31, 2018 via email

olgabot commented Oct 31, 2018

ctb commented Dec 27, 2018

ctb commented Dec 27, 2018

ctb commented Jul 3, 2020

ctb commented Sep 23, 2021

Add ability to adjust --num-hashes, --scaled on-the-fly in 'compare' and 'categorize' #560

Add ability to adjust --num-hashes, --scaled on-the-fly in 'compare' and 'categorize' #560

Comments

olgabot commented Oct 26, 2018

ctb commented Oct 26, 2018 via email

olgabot commented Oct 26, 2018

ctb commented Oct 26, 2018 via email

luizirber commented Oct 26, 2018

olgabot commented Oct 31, 2018

ctb commented Oct 31, 2018 via email

olgabot commented Oct 31, 2018

ctb commented Dec 27, 2018

ctb commented Dec 27, 2018

ctb commented Jul 3, 2020

ctb commented Sep 23, 2021