Replace mx by scaled #1139

xmnlab · 2020-08-02T21:33:48Z

Is it mergeable?
make test Did it pass the tests?
make coverage Is the new code covered?
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Was a spellchecker run on the source code and documentation after
changes were made?

codecov · 2020-08-02T22:08:37Z

Codecov Report

Merging #1139 into latest will increase coverage by 9.10%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #1139      +/-   ##
==========================================
+ Coverage   84.13%   93.24%   +9.10%     
==========================================
  Files          99       78      -21     
  Lines        9218     5922    -3296     
==========================================
- Hits         7756     5522    -2234     
+ Misses       1462      400    -1062

Flag	Coverage Δ
#rusttests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sourmash/minhash.py	`96.98% <100.00%> (-0.30%)`	⬇️
src/core/src/sketch/nodegraph.rs
src/core/src/index/bigsi.rs
src/core/src/ffi/mod.rs
src/core/src/lib.rs
src/core/src/index/sbt/mod.rs
src/core/src/signature.rs
src/core/src/index/mod.rs
src/core/src/ffi/cmd/compute.rs
src/core/src/sketch/ukhs.rs
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7a8e5ac...0a858fa. Read the comment docs.

xmnlab · 2020-08-04T14:40:22Z

@ctb @luizirber it is still raising some errors. I will work on that again today.
is this change what you had in mind? Am I going in the right direction? thanks!

ctb · 2020-08-04T14:44:54Z

on a quick skim, this looks like the right direction to me!

xmnlab · 2020-08-04T14:56:27Z

thanks @ctb! I will try to have it ready for review today!

ctb · 2020-08-04T14:58:19Z

On Tue, Aug 04, 2020 at 07:56:43AM -0700, Ivan Ogasawara wrote: thanks @ctb! I will try to have it ready for review today!

no hurry, take your time :)

ctb · 2020-08-04T16:30:35Z

hi @xmnlab note that we just changed the default main branch in this repository to latest - at some point before merge, please update your base to match (click Edit, change base to latest). Thanks!

luizirber · 2020-08-05T17:10:05Z

Nice PR, @xmnlab!

But I think we need to be more explicit in the consequences of changing max_hash to scaled in the functions. My main concern is that current code using max_hash will continue "working" with scaled (since they are the same type), but the meaning of the parameter changed completely!

We can use this opportunity to bring it more in line with the changes happening on the Python side in #1128.
To make it break explicitly and force other crates using sourmash to do the changes, maybe define new as

pub fn new(
  scaled: u64,
  ksize: u32,
  hash_function: HashFunctions,
  track_abundance: bool,
  seed: u64,
  num: Option<u32>,
) -> KmerMinHash

and similarly in the FFI

#[no_mangle]
pub unsafe extern "C" fn kmerminhash_new(
    scaled: u64,
    k: u32,
    prot: bool,
    dayhoff: bool,
    hp: bool,
    track_abundance: bool,
    seed: u64,
    num: u32
) -> *mut SourmashKmerMinHash {

(let @ctb chime in before implementing these changes, @xmnlab =])

xmnlab · 2020-08-06T00:41:52Z

thanks @luizirber! I updated my branch on top of the "latest" branch. about your suggestions, sounds good for me, I am waiting for a green light from @ctb :)

ctb · 2020-08-06T03:09:47Z

On Wed, Aug 05, 2020 at 05:42:04PM -0700, Ivan Ogasawara wrote: thanks @luizirber! I updated my branch on top of the "latest" branch. about your suggestions, sounds good for me, I am waiting for a green light from @ctb :)

oh, don't listen to me, listen to @luizirber :).

xmnlab · 2020-08-06T15:09:14Z

@ctb I will work on that! thanks! :)

xmnlab · 2020-08-08T03:07:12Z

not sure if I am doing anything wrong, or if it is relevant, but rust scaled_for_max_hash and python _get_scaled_for_max_hash return a different value for number 1

python result -> 18446744073709551616
rust result   -> 18446744073709551615

if I change the python operation to use // instead of / it seems it returns the same rust result:

int(round(get_minhash_max_hash() // 1, 0))
18446744073709551615

luizirber · 2020-08-08T14:55:26Z

not sure if I am doing anything wrong, or if it is relevant, but rust scaled_for_max_hash and python _get_scaled_for_max_hash return a different value for number 1
python result -> 18446744073709551616
rust result   -> 18446744073709551615 

Good catch! That is definitely a bug, but we never triggered it because setting max_hash=1 is not a viable use case.

if I change the python operation to use // instead of / it seems it returns the same rust result:
int(round(get_minhash_max_hash() // 1, 0))
18446744073709551615 

Instead of fixing the Python code, I think we should fix the Rust code (to guarantee it has the same behavior as any existing code using sourmash).

xmnlab · 2020-08-08T16:18:08Z

Instead of fixing the Python code, I think we should fix the Rust code (to guarantee it has the same behavior as any existing code using sourmash).

this is the current rust implementation:

pub fn scaled_for_max_hash(max_hash: u64) -> u64 {
    match max_hash {
        0 => 0,
        _ => u64::max_value() / max_hash,
    }
}

maybe I am wrong, but as far as I understood, the rust result is the u64 max value and the python result is equal to the rust max value +1, so the rust call will break (integer overflow). does It make sense?

luizirber · 2020-08-08T16:38:00Z

maybe I am wrong, but as far as I understood, the rust result is the u64 max value and the python result is equal to the rust max value +1, so the rust call will break (integer overflow). does It make sense?

You're correct, it was a trick question! (not really, I didn't notice that, and you analyzed correctly =])

In this case, let's keep scaled_for_max_hash in Rust as is.

xmnlab · 2020-08-08T16:43:00Z

Sounds pretty good! Do you want I change it here in this PR? Or should I fix that in a new one? El sáb., 8 de agosto de 2020 12:38, Luiz Irber <[email protected]> escribió:

…

maybe I am wrong, but as far as I understood, the rust result is the u64 max value and the python result is equal to the rust max value +1, so the rust call will break (integer overflow). does It make sense? You're correct, it was a trick question! (not really, I didn't notice that, and you analyzed correctly =]) In this case, let's keep scaled_for_max_hash in Rust as is. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1139 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHX5HLQECRTSUARPQQGSQTR7V5PHANCNFSM4PSX456A> .

luizirber · 2020-08-08T17:57:20Z

Sounds pretty good! Do you want I change it here in this PR? Or should I fix that in a new one?

Here is good

luizirber · 2020-08-09T00:53:10Z

so, I just hit this in another PR, and.. I think the Rust function is broken =]

(it wasn't used in any place, and there are no tests, so...)

I tried checking the scaled for 184467440737095520, and Python gives 100 while Rust says 99...

xmnlab · 2020-08-09T03:41:06Z

do you have any idea how to fix that? is there any scientific paper that describes this function?

luizirber · 2020-08-09T15:52:41Z

do you have any idea how to fix that?

Before thinking on how to fix it, this is the perfect example for using property-based testing (because we have two functions that applied in sequence should return the initial value). In Python we have hypothesis tests, and in Rust there is proptest.

In tests/test__minhash_hypothesis.py:

 @given(st.integers(min_value=1, max_value=2**64 - 1))                                                                                                                                        
 def test_scaled_max_hash_conversion(scaled):                                                                                                                                                 
     max_hash = _get_max_hash_for_scaled(scaled)                                                                                                                                              
     new_scaled = _get_scaled_for_max_hash(max_hash)                                                                                                                                          
                                                                                                                                                                                              
     assert scaled == new_scaled

in src/core/tests/minhash.rs:

proptest! {
#[test]
fn scaled_and_max_hash(scaled in u64::ANY) {
    let max_hash = max_hash_for_scaled(scaled).unwrap();
    let new_scaled = scaled_for_max_hash(max_hash);

    assert_eq!(scaled, new_scaled);
}
}

is there any scientific paper that describes this function?

Ouch. Not yet =]

The main issue is that the definition of the function doesn't worry enough about numerical precision...
max_hash and scaled are the inverse of each other
max_hash = max_value / scaled
scaled = max_value / max_hash
where max_value is (2 ** 64) - 1 for 64-bit integers.

But, that division is not well defined (but rounding floats is definitely not the solution! Probably proper floor and ceil operations). I'll think a bit about it and comment again later.

luizirber · 2020-08-09T16:57:47Z

OMG what a gigantic rabbit hole.

But, that division is not well defined (but rounding floats is definitely not the solution! Probably proper floor and ceil operations). I'll think a bit about it and comment again later.

The main design constrains:

We need to remain compatible with already existing scaled MinHash sketches
Realistic scaled values are in the range (1, 100000)

From that:

The tests I suggested point very quickly that it is hard to invert max_hash/scaled for scaled values outside the range (like 42537073439)
Fixing for values outside the range might mess values inside the range...
Hypothesis tests in Python with the realistic range don't fail with the current Python impl.

So I guess we should be bug-compatible in Rust too.

Thoughs, @ctb?

ctb · 2020-08-09T17:09:23Z

My hot takes --

bug compatibility is good, for now!
is there a way to ensure that the round trip continues to work, as in scaled == get_scaled_for_max_hash(get_max_hash_for_scaled(scaled)), and each function returns integer values?
some tests (especially ones that discover absurdities) would be most welcome

xmnlab · 2020-08-09T23:02:54Z

thanks @ctb and @luizirber for all the explanations and suggestions.

but I am not too sure how to move forward now ...

luizirber · 2020-08-09T23:42:24Z

* is there a way to ensure that the round trip continues to work, as in `scaled == get_scaled_for_max_hash(get_max_hash_for_scaled(scaled))`, and each function returns integer values?

* some tests (especially ones that discover absurdities) would be most welcome

That's what the property tests are doing (it's implemented as you described in the first item, and hypothesis/proptest generate input data like the second one).

In the Hypothesis case we can be more explicit and provide some examples for the values we tend to use:

@given(st.integers(min_value=1, max_value=100000))
@example(1)
@example(10)
@example(100)
@example(1000)
@example(2000)
@example(10000)
def test_scaled_max_hash_conversion(scaled):
    max_hash = _get_max_hash_for_scaled(scaled)
    new_scaled = _get_scaled_for_max_hash(max_hash)
    assert scaled == new_scaled

thanks @ctb and @luizirber for all the explanations and suggestions.

but I am not too sure how to move forward now ...

How about:

Add the hypothesis and proptest tests
fix the Rust scaled_for_max_hash and max_hash_to_scaled to match the Python impl (using float numbers, and rounding)
PROFIT! =P

xmnlab · 2020-08-10T00:28:25Z

great! I have a good picture now to start! thanks!

xmnlab · 2020-08-10T18:18:30Z

fix the Rust scaled_for_max_hash and max_hash_to_scaled to match the Python impl (using float numbers, and rounding)

it seems it loses precision

>>> println!("using f64: {}", u64::max_value() as f64);
>>> println!("using u64: {}", u64::max_value());

using f64: 18446744073709552000
using u64: 18446744073709551615

and casting to u64, it will truncate to the max of u64, so it will never be the same as python (for this example the result would be 18446744073709551616):

>>> println!("using ceil: {}", ((u64::max_value() as f64) / 1.0f64).ceil() as u64);
>>> println!("using u64:  {}", u64::max_value());

using ceil: 18446744073709551615
using u64:  18446744073709551615

any idea about how we should handle this problem?

luizirber · 2020-08-10T21:38:06Z

it seems it loses precision

>>> println!("using f64: {}", u64::max_value() as f64);
>>> println!("using u64: {}", u64::max_value());

using f64: 18446744073709552000
using u64: 18446744073709551615

yikes!

(is thix evcxr? Cool!)

and casting to u64, it will truncate to the max of u64, so it will never be the same as python (for this example the result would be 18446744073709551616):
>>> println!("using ceil: {}", ((u64::max_value() as f64) / 1.0f64).ceil() as u64);
>>> println!("using u64:  {}", u64::max_value());

using ceil: 18446744073709551615
using u64:  18446744073709551615

the max value for 64 bits is 2**64 -1, which is 18446744073709551615, so the first version (using ceil) is correct. I think you can use that for implementing the conversion?

xmnlab · 2020-08-10T22:03:09Z

(is thix evcxr? Cool!)

yes! that is very helpful!

I think you can use that for implementing the conversion?

but the value returned by the current python implementation is still different 18446744073709551616 ... to have the same result with the python implementation I would need to do int(round(get_minhash_max_hash() // 1, 0))

not sure if I am missing something

luizirber · 2020-08-10T22:34:16Z

but the value returned by the current python implementation is still different 18446744073709551616 ... to have the same result with the python implementation I would need to do int(round(get_minhash_max_hash() // 1, 0))

_get_max_hash_for_scaled(1) is 18446744073709551615 here, I see 18446744073709551616 when doing _get_scaled_for_max_hash(1).

Since it is out of the realistic scaled values range (I mean, a Scaled MinHash with scaled=18446744073709551616 or max_hash=1 will literally have only (0, 1) as possible values!), I don't think it is a big issue to leave as is (returning 18446744073709551615).

(And even tho we said "bug-compatible", that is a bug we want to fix, because 18446744073709551616 is out of the range of possible 64-bit integers... It is easier to see this in hexadecimal:

>>> "{:X}".format(18446744073709551616)
'10000000000000000'

>>> "{:X}".format(18446744073709551615)
'FFFFFFFFFFFFFFFF'

)

xmnlab · 2020-08-17T01:30:34Z

sorry .. I didn't have time too much last week. my vacations start on Monday so I will be back here tomorrow!! thanks for the patience :)

luizirber · 2020-08-22T15:44:14Z

src/core/src/sketch/minhash.rs

@@ -622,12 +624,12 @@ impl KmerMinHash {
        self.check_compatible(other)?;

        let mut combined_mh = KmerMinHash::new(
-            self.num,
+            scaled_for_max_hash(self.max_hash),


how about using

Suggested change

scaled_for_max_hash(self.max_hash),

self.scaled,

here? self.scaled is still using max_hash in its impl, but we can fix that later (and avoid a mention to self.max_hash here =])

@luizirber sounds good. could you add more details about that pls? :)

currently we have thousands of warnings in CI (example) triggered because we are still using .max_hash internally. Eventually (for 4.0) we want to avoid using/documenting/suggesting .max_hash and focus on .scaled instead, including constructors (and hence why we deprecated .max_hash in 3.5 and generate all the warnings)

But that's all in the Python side, and this PR is for changing the Rust side. So, you don't need to worry about it if you don't want to, but I thought it was relevant to keep in mind =]

I see .. thanks for the explanation :)

luizirber · 2020-08-22T15:49:16Z

@luizirber should I also change KmerMinHashBTree signature?

Yes, please! KmerMinHash and KmerMinHashBTree have the same API but don't have a trait to connect them, so for now we need to change in both places... #1055 will fix it, but is not a priority.

xmnlab · 2020-09-05T01:58:16Z

hey everyone, sorry for the delay here. I will try to finish this as soon as possible.

I have some issue with 2 tests, but not sure yet why ...

running 19 tests
test invalid_dna ... ok
test check_errors ... ok
test dayhoff ... ok
test hp ... ok
test load_save_minhash_sketches ... ok
test load_save_minhash_sketches_abund ... FAILED
test max_for_scaled ... ok
test merge ... ok
test merge_empty_scaled ... ok
test load_save_minhash_dna ... ok
test load_save_minhash_dayhoff ... ok
test load_save_minhash_hp ... ok
test similarity ... ok
test similarity_2 ... ok
test similarity_3 ... ok
test throws_error ... ok
test prop_merge ... FAILED
test oracle_mins_scaled ... ok
test oracle_mins ... ok

failures:

---- load_save_minhash_sketches_abund stdout ----
thread 'load_save_minhash_sketches_abund' panicked at 'called `Result::unwrap()` on an `Err` value: MismatchScaled', src/core/tests/minhash.rs:499:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- prop_merge stdout ----
proptest: FileFailurePersistence::SourceParallel set, but failed to find lib.rs or main.rs
thread 'prop_merge' panicked at 'called `Result::unwrap()` on an `Err` value: MismatchScaled', src/core/tests/minhash.rs:324:16
...
thread 'prop_merge' panicked at 'called `Result::unwrap()` on an `Err` value: MismatchScaled', src/core/tests/minhash.rs:324:16
thread 'prop_merge' panicked at 'Test failed: called `Result::unwrap()` on an `Err` value: MismatchScaled; minimal failing input: seq1 = "GGGTCC", seq2 = "CTTTGA"
        successes: 0
        local rejects: 0
        global rejects: 0
', src/core/tests/minhash.rs:299:1


failures:
    load_save_minhash_sketches_abund
    prop_merge

test result: FAILED. 17 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out

error: test failed, to rerun pass '--test minhash'

any tip about this?

xmnlab · 2020-09-05T03:01:04Z

it seems the problem is the new function max_hash_for_scaled

In:

pub fn max_hash_for_scaled(scaled: u64) -> u64 {
    match scaled {
        0 => 0,
        1 => u64::max_value(),
        _ => (u64::max_value() as f64 / scaled as f64) as u64,
    }
}

pub fn scaled_for_max_hash(max_hash: u64) -> u64 {
    match max_hash {
        0 => 0,
        _ => u64::max_value() / max_hash,
    }
}

let scaled = 10;
let max_hash = max_hash_for_scaled(scaled);

println!("max_hash (from scaled): {}", max_hash);
println!("scaled: {}", scaled);
println!("scaled (from max_hash): {}", scaled_for_max_hash(max_hash));

Output:

max_hash (from scaled): 1844674407370955264
scaled: 10
scaled (from max_hash): 9

@luizirber any recommendation about this?

xmnlab · 2020-09-05T03:53:27Z

I used the same approach for scaled_for_max_hash (converting to float) and it seems it worked, at least for the tests.

I tested locally the conversion between scaled_for_max_hash and max_hash_for_scaled and it seems it won't work in some cases (as the conversion to integer lost information). not sure if it is relevant for this PR.

let me know if I should take another direction :) thanks!

xmnlab · 2020-09-07T01:23:08Z

@ctb @luizirber finally it is ready for review. thanks for the patience!

ctb · 2020-09-09T13:33:49Z

hi @xmnlab thank you! I am not competent to review the rust code changes in detail, but everything looks good to me on the Python side (which you basically didn't change - yay :). @luizirber is working to a deadline so may take a few more days before he gets to review this, but let me be the first to say "good job!"

xmnlab · 2020-09-09T15:20:55Z

Thank you so much for the feedback @ctb <3
I am still learning Rust .. but I will be very happy to help if there is any other issue for beginner :)

xmnlab · 2020-10-08T13:45:13Z

hey everyone! just a friendly reminder about this PR :) let me know if I should address anything else here :) thanks!

luizirber · 2020-10-08T19:55:19Z

sourmash/minhash.py

+            lib.HASH_FUNCTIONS_MURMUR64_DAYHOFF if dayhoff else
+            lib.HASH_FUNCTIONS_MURMUR64_HP if hp else
+            lib.HASH_FUNCTIONS_MURMUR64_PROTEIN if is_protein else
+            lib.HASH_FUNCTIONS_MURMUR64_DNA


it makes sense that this works, but I've never seen it being used this way =]

luizirber

This looks great! Thanks @xmnlab!

xmnlab · 2020-10-08T23:14:20Z

thanks @ctb and @luizirber for all the patience and attention :)

ctb · 2020-10-09T16:17:18Z

thank YOU for all your work! I guess no PR is as simple as I initially think :laugh:

xmnlab changed the base branch from master to latest August 4, 2020 18:22

luizirber reviewed Aug 22, 2020

View reviewed changes

xmnlab added 5 commits August 28, 2020 11:03

Replace mx by scaled

9ca1fe0

Change kmerminhash_new signature

afe53aa

fix small issues

5464117

Change to hash_function

ec02bc5

Fix some test issues

7ce3791

update scaled_for_max_hash

a5731ce

xmnlab added 2 commits September 6, 2020 20:35

Remove extra space

9cbaed1

Fix some wasm issues

141c503

xmnlab changed the title ~~[WIP] Replace mx by scaled~~ Replace mx by scaled Sep 7, 2020

xmnlab marked this pull request as ready for review September 7, 2020 01:22

luizirber reviewed Oct 8, 2020

View reviewed changes

luizirber approved these changes Oct 8, 2020

View reviewed changes

luizirber and others added 3 commits October 8, 2020 19:59

Merge branch 'latest' into use-scaled-instead-of-max_hash

f4ca356

fix lints

068c6e3

bump rust version

0a858fa

luizirber merged commit 5e8b66a into sourmash-bio:latest Oct 8, 2020

xmnlab deleted the use-scaled-instead-of-max_hash branch October 8, 2020 23:13

ctb mentioned this pull request Feb 7, 2021

Draft release notes for v4.0 #1309

Closed

olgabot mentioned this pull request Mar 16, 2021

Different max_hash values with Rust vs Python #1402

Open

Replace mx by scaled #1139

Replace mx by scaled #1139

Conversation

xmnlab commented Aug 2, 2020 • edited Loading

codecov bot commented Aug 2, 2020 • edited Loading

Codecov Report

xmnlab commented Aug 4, 2020

ctb commented Aug 4, 2020

xmnlab commented Aug 4, 2020

ctb commented Aug 4, 2020 via email

ctb commented Aug 4, 2020

luizirber commented Aug 5, 2020

xmnlab commented Aug 6, 2020

ctb commented Aug 6, 2020 via email

xmnlab commented Aug 6, 2020

xmnlab commented Aug 8, 2020 • edited Loading

luizirber commented Aug 8, 2020

xmnlab commented Aug 8, 2020

luizirber commented Aug 8, 2020

xmnlab commented Aug 8, 2020 via email

luizirber commented Aug 8, 2020

luizirber commented Aug 9, 2020

xmnlab commented Aug 9, 2020

luizirber commented Aug 9, 2020 • edited Loading

luizirber commented Aug 9, 2020

ctb commented Aug 9, 2020

xmnlab commented Aug 9, 2020

luizirber commented Aug 9, 2020

xmnlab commented Aug 10, 2020

xmnlab commented Aug 10, 2020

luizirber commented Aug 10, 2020 • edited Loading

xmnlab commented Aug 10, 2020 • edited Loading

luizirber commented Aug 10, 2020

xmnlab commented Aug 17, 2020

luizirber Aug 22, 2020

Choose a reason for hiding this comment

xmnlab Aug 28, 2020

Choose a reason for hiding this comment

luizirber Aug 28, 2020

Choose a reason for hiding this comment

xmnlab Aug 28, 2020

Choose a reason for hiding this comment

luizirber commented Aug 22, 2020

xmnlab commented Sep 5, 2020

xmnlab commented Sep 5, 2020 • edited Loading

xmnlab commented Sep 5, 2020

xmnlab commented Sep 7, 2020

ctb commented Sep 9, 2020

xmnlab commented Sep 9, 2020

xmnlab commented Oct 8, 2020

luizirber Oct 8, 2020

Choose a reason for hiding this comment

luizirber left a comment

Choose a reason for hiding this comment

xmnlab commented Oct 8, 2020

ctb commented Oct 9, 2020 via email

xmnlab commented Aug 2, 2020 •

edited

Loading

codecov bot commented Aug 2, 2020 •

edited

Loading

xmnlab commented Aug 8, 2020 •

edited

Loading

luizirber commented Aug 9, 2020 •

edited

Loading

luizirber commented Aug 10, 2020 •

edited

Loading

xmnlab commented Aug 10, 2020 •

edited

Loading

xmnlab commented Sep 5, 2020 •

edited

Loading