-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace mx by scaled #1139
Replace mx by scaled #1139
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1139 +/- ##
==========================================
+ Coverage 84.13% 93.24% +9.10%
==========================================
Files 99 78 -21
Lines 9218 5922 -3296
==========================================
- Hits 7756 5522 -2234
+ Misses 1462 400 -1062
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
@ctb @luizirber it is still raising some errors. I will work on that again today. |
on a quick skim, this looks like the right direction to me! |
thanks @ctb! I will try to have it ready for review today! |
On Tue, Aug 04, 2020 at 07:56:43AM -0700, Ivan Ogasawara wrote:
thanks @ctb! I will try to have it ready for review today!
no hurry, take your time :)
|
hi @xmnlab note that we just changed the default main branch in this repository to |
Nice PR, @xmnlab! But I think we need to be more explicit in the consequences of changing We can use this opportunity to bring it more in line with the changes happening on the Python side in #1128. pub fn new(
scaled: u64,
ksize: u32,
hash_function: HashFunctions,
track_abundance: bool,
seed: u64,
num: Option<u32>,
) -> KmerMinHash and similarly in the FFI #[no_mangle]
pub unsafe extern "C" fn kmerminhash_new(
scaled: u64,
k: u32,
prot: bool,
dayhoff: bool,
hp: bool,
track_abundance: bool,
seed: u64,
num: u32
) -> *mut SourmashKmerMinHash { (let @ctb chime in before implementing these changes, @xmnlab =]) |
thanks @luizirber! I updated my branch on top of the "latest" branch. about your suggestions, sounds good for me, I am waiting for a green light from @ctb :) |
On Wed, Aug 05, 2020 at 05:42:04PM -0700, Ivan Ogasawara wrote:
thanks @luizirber! I updated my branch on top of the "latest" branch. about your suggestions, sounds good for me, I am waiting for a green light from @ctb :)
oh, don't listen to me, listen to @luizirber :).
|
@ctb I will work on that! thanks! :) |
not sure if I am doing anything wrong, or if it is relevant, but rust
if I change the python operation to use
|
Good catch! That is definitely a bug, but we never triggered it because setting
Instead of fixing the Python code, I think we should fix the Rust code (to guarantee it has the same behavior as any existing code using sourmash). |
this is the current rust implementation: pub fn scaled_for_max_hash(max_hash: u64) -> u64 {
match max_hash {
0 => 0,
_ => u64::max_value() / max_hash,
}
} maybe I am wrong, but as far as I understood, the rust result is the u64 max value and the python result is equal to the rust max value +1, so the rust call will break (integer overflow). does It make sense? |
You're correct, it was a trick question! (not really, I didn't notice that, and you analyzed correctly =]) In this case, let's keep |
Sounds pretty good! Do you want I change it here in this PR? Or should I
fix that in a new one?
El sáb., 8 de agosto de 2020 12:38, Luiz Irber <[email protected]>
escribió:
… maybe I am wrong, but as far as I understood, the rust result is the u64
max value and the python result is equal to the rust max value +1, so the
rust call will break (integer overflow). does It make sense?
You're correct, it was a trick question! (not really, I didn't notice
that, and you analyzed correctly =])
In this case, let's keep scaled_for_max_hash in Rust as is.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1139 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHX5HLQECRTSUARPQQGSQTR7V5PHANCNFSM4PSX456A>
.
|
Here is good |
so, I just hit this in another PR, and.. I think the Rust function is broken =] (it wasn't used in any place, and there are no tests, so...) I tried checking the scaled for |
do you have any idea how to fix that? is there any scientific paper that describes this function? |
Before thinking on how to fix it, this is the perfect example for using property-based testing (because we have two functions that applied in sequence should return the initial value). In Python we have hypothesis tests, and in Rust there is proptest. In @given(st.integers(min_value=1, max_value=2**64 - 1))
def test_scaled_max_hash_conversion(scaled):
max_hash = _get_max_hash_for_scaled(scaled)
new_scaled = _get_scaled_for_max_hash(max_hash)
assert scaled == new_scaled in proptest! {
#[test]
fn scaled_and_max_hash(scaled in u64::ANY) {
let max_hash = max_hash_for_scaled(scaled).unwrap();
let new_scaled = scaled_for_max_hash(max_hash);
assert_eq!(scaled, new_scaled);
}
}
Ouch. Not yet =] The main issue is that the definition of the function doesn't worry enough about numerical precision... But, that division is not well defined (but rounding floats is definitely not the solution! Probably proper floor and ceil operations). I'll think a bit about it and comment again later. |
OMG what a gigantic rabbit hole.
The main design constrains:
From that:
So I guess we should be bug-compatible in Rust too. Thoughs, @ctb? |
My hot takes --
|
thanks @ctb and @luizirber for all the explanations and suggestions. but I am not too sure how to move forward now ... |
That's what the property tests are doing (it's implemented as you described in the first item, and hypothesis/proptest generate input data like the second one). In the Hypothesis case we can be more explicit and provide some examples for the values we tend to use: @given(st.integers(min_value=1, max_value=100000))
@example(1)
@example(10)
@example(100)
@example(1000)
@example(2000)
@example(10000)
def test_scaled_max_hash_conversion(scaled):
max_hash = _get_max_hash_for_scaled(scaled)
new_scaled = _get_scaled_for_max_hash(max_hash)
assert scaled == new_scaled
How about:
|
great! I have a good picture now to start! thanks! |
it seems it loses precision >>> println!("using f64: {}", u64::max_value() as f64);
>>> println!("using u64: {}", u64::max_value());
using f64: 18446744073709552000
using u64: 18446744073709551615 and casting to u64, it will truncate to the max of u64, so it will never be the same as python (for this example the result would be >>> println!("using ceil: {}", ((u64::max_value() as f64) / 1.0f64).ceil() as u64);
>>> println!("using u64: {}", u64::max_value());
using ceil: 18446744073709551615
using u64: 18446744073709551615 any idea about how we should handle this problem? |
yikes! (is thix evcxr? Cool!)
the max value for 64 bits is 2**64 -1, which is |
yes! that is very helpful!
but the value returned by the current python implementation is still different not sure if I am missing something |
Since it is out of the realistic scaled values range (I mean, a Scaled MinHash with (And even tho we said "bug-compatible", that is a bug we want to fix, because >>> "{:X}".format(18446744073709551616)
'10000000000000000'
>>> "{:X}".format(18446744073709551615)
'FFFFFFFFFFFFFFFF' ) |
sorry .. I didn't have time too much last week. my vacations start on Monday so I will be back here tomorrow!! thanks for the patience :) |
src/core/src/sketch/minhash.rs
Outdated
@@ -622,12 +624,12 @@ impl KmerMinHash { | |||
self.check_compatible(other)?; | |||
|
|||
let mut combined_mh = KmerMinHash::new( | |||
self.num, | |||
scaled_for_max_hash(self.max_hash), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about using
scaled_for_max_hash(self.max_hash), | |
self.scaled, |
here? self.scaled
is still using max_hash
in its impl, but we can fix that later (and avoid a mention to self.max_hash
here =])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luizirber sounds good. could you add more details about that pls? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently we have thousands of warnings in CI (example) triggered because we are still using .max_hash
internally. Eventually (for 4.0) we want to avoid using/documenting/suggesting .max_hash
and focus on .scaled
instead, including constructors (and hence why we deprecated .max_hash
in 3.5 and generate all the warnings)
But that's all in the Python side, and this PR is for changing the Rust side. So, you don't need to worry about it if you don't want to, but I thought it was relevant to keep in mind =]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see .. thanks for the explanation :)
Yes, please! |
hey everyone, sorry for the delay here. I will try to finish this as soon as possible. I have some issue with 2 tests, but not sure yet why ...
any tip about this? |
it seems the problem is the new function In: pub fn max_hash_for_scaled(scaled: u64) -> u64 {
match scaled {
0 => 0,
1 => u64::max_value(),
_ => (u64::max_value() as f64 / scaled as f64) as u64,
}
}
pub fn scaled_for_max_hash(max_hash: u64) -> u64 {
match max_hash {
0 => 0,
_ => u64::max_value() / max_hash,
}
}
let scaled = 10;
let max_hash = max_hash_for_scaled(scaled);
println!("max_hash (from scaled): {}", max_hash);
println!("scaled: {}", scaled);
println!("scaled (from max_hash): {}", scaled_for_max_hash(max_hash)); Output:
@luizirber any recommendation about this? |
I used the same approach for I tested locally the conversion between scaled_for_max_hash and max_hash_for_scaled and it seems it won't work in some cases (as the conversion to integer lost information). not sure if it is relevant for this PR. let me know if I should take another direction :) thanks! |
@ctb @luizirber finally it is ready for review. thanks for the patience! |
hi @xmnlab thank you! I am not competent to review the rust code changes in detail, but everything looks good to me on the Python side (which you basically didn't change - yay :). @luizirber is working to a deadline so may take a few more days before he gets to review this, but let me be the first to say "good job!" |
Thank you so much for the feedback @ctb <3 |
hey everyone! just a friendly reminder about this PR :) let me know if I should address anything else here :) thanks! |
lib.HASH_FUNCTIONS_MURMUR64_DAYHOFF if dayhoff else | ||
lib.HASH_FUNCTIONS_MURMUR64_HP if hp else | ||
lib.HASH_FUNCTIONS_MURMUR64_PROTEIN if is_protein else | ||
lib.HASH_FUNCTIONS_MURMUR64_DNA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it makes sense that this works, but I've never seen it being used this way =]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks @xmnlab!
thanks @ctb and @luizirber for all the patience and attention :) |
thank YOU for all your work! I guess no PR is as simple as I initially
think :laugh:
|
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?
Fixes #1134