-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to adjust --num-hashes, --scaled on-the-fly in 'compare' and 'categorize' #560
Comments
Have you tried using --scaled with search and compare?
|
Thanks for the quick response! My issue is that droplet-based methods (~10k reads/cell) are much shallower than full-transcript methods (~500k reads/cell) and I'd like to use the same number of kmers/hashes across both, not just the same scaling factor. |
On Fri, Oct 26, 2018 at 04:35:17PM +0000, Olga Botvinnik wrote:
Thanks for the quick response! My issue is that droplet-based methods (~10k reads/cell) are much shallower than full-transcript methods (~500k reads/cell) and I'd like to use the same number of kmers/hashes across both, not just the same scaling factor.
my intuition is that neither --num nor --scaled will solve your concern
here, because you simply have more k-mers to look at either way and
neither method is robust to that. I think you probably want to downsample
your data by number of reads instead. but that's off the cuff and could
well be wrong :)
|
would something like #538 help? |
@luizirber Yes, it would be very helpful to be able to switch between |
this can be done programmatically quite easily (construct a new MinHash object,
shove hashes in with add_many). I'm more skeptical that this needs to be
supported by sourmash at the command-line level, or at least I feel like it
would add significantly to the complexity. What do you think about us
providing an example script that does the right thing, without integrating
it into the command line?
|
that would be very helpful! and definitely eliminate the number of redundant computations I'm doing |
sorry for dropping the ball on this... but, looking at #587, I wonder if this functionality fits under downsample? I could imagine adding a flag to the command like so:
and/or
where --num and --scaled are (for the moment) incompatible, and fail when the downsampling cannot be done properly. Whaddya think? |
Also, in #436 I added some example Python code :) |
#1072 would resolve this, I think. |
The core functionality is available in Closing until the specific CLI functionality is requested again :) |
I'd like to be able to calculate ONE "deep" signature of e.g. 10k kmers for each sample and then adjust the scaling factor to 5k kmers, 1k kmers, 500 kmers, 100 kmers, see how that affects the similarity and nearest neighbors.
The text was updated successfully, but these errors were encountered: