-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MinHash.remove_many
continues to be very slow when removing many hashes
#1617
Comments
MinHash.remove_many
continues to be very slowMinHash.remove_many
continues to be very slow when removing many hashes
The .sig file can be downloaded here, https://osf.io/egks5/, or is in |
I think the only way to enhance this is to parallelize the removals or convert the hashes vector alongside the abundance vector to a single hashmap<hash, abundance>. In the provided example, there will be 8,577,001 search operations on the vector, which -AFAI Know- O(n) just to get the item index then remove it. |
This seems to work pretty fast. It's interesting how Python is so much faster than Rust, don't you think? 😜
|
Hahaha, using set will be way faster ... Here's what is happening in Rust import sourmash
print('loading...')
big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
print(f'...done. loaded {len(big_sig.minhash)} hashes.')
print('Converting to list...')
hashes_list = list(big_sig.minhash.hashes)
abundance_list = hashes_list.copy() # Simulate abundance vector
to_be_removed = hashes_list.copy()
print('subtracting...')
for hash in to_be_removed:
idx = hashes_list.index(hash)
del hashes_list[idx]
del abundance_list[idx]
print("Done") Vectors are being used to hold hashes and abundances values to be kept in order. Using set instead of vector will not preserve the insertion order. |
Let's try to disentangle a bit the many threads going on this conversation =]
This code takes shortcuts (is not doing
My point here: the Rust code is trying to avoid allocating more memory than needed, and this is DISASTROUS with the current implementation when removing many hashes. Since it is an ordered vector, for each removal it needs to reallocate large chunks of the vector (as Mo pointed out in his explanation of what Rust is doing). This is easy to see with
It literally spends all the time moving memory around. So, what to do?
If you can cheat, I can cheat too =] There is some wonkiness in the API, but I used the released crate (so no optimizations in use sourmash::signature::{Signature, SigsTrait};
use sourmash::sketch::minhash::KmerMinHashBTree;
use sourmash::sketch::Sketch;
fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("loading...");
let (reader, _) = niffler::from_path("SRR7186547.k31.sig")?;
let mut sig: Vec<Signature> = serde_json::from_reader(reader)?;
if let Sketch::MinHash(big_sig) = sig.swap_remove(0).sketches().swap_remove(0) {
println!("...done. loaded {} hashes.", big_sig.size());
println!("converting to mutable...");
let mut mh: KmerMinHashBTree = big_sig.into();
println!("...done");
println!("subtracting...");
mh.remove_many(&mh.mins())?;
println!("...done");
}
Ok(())
} TimingsRust:
Python (with sets):
|
😆 My intention was to point out that there must be options. Thank you for falling into my trap^W^W^W^Wexploring them! |
and let me just say how adorable the 🦀 and 🐍 are! |
I know, I know. Just pointing that the options where there all along, but... not implemented all the way across FFI =P
Right? gonna start using it all the time when talking about Rust and Python types 😸 |
Coming from #1771 sourmash/src/core/src/sketch/minhash.rs Lines 399 to 409 in 401ba48
Performing a binary search on every delete is expected to slow down the process of removing many elements. Would replacing the |
Ok, that was already discussed . |
we should revisit the code in #2123 if/when we speed up |
I ran the scaled=1000 version of the benchmark in #1747:
and saw the following: so, still some work to do here :) It took < 2 hours to run, but not by much. |
oh, and when we zoom in on the block on the right, we see the other big chunk of time is in this generator expression in
|
despite #1571, the problems in #1552 continued after using the new
remove_many
implementation until I refactored the enclosing script in #1613.The following script reproduces the problem:
The text was updated successfully, but these errors were encountered: