Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mega Change #28

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Mega Change #28

wants to merge 5 commits into from

Conversation

logannc
Copy link
Owner

@logannc logannc commented Mar 27, 2021

fixes #24
fixes #22
fixes #20
fixes #7 (mostly - once you have extractWithoutOrder, the rest are basically just 'get top N' which can be done by callers)

#6 might or might not make it depending on the wratio cleanups.

@logannc logannc added the wip work in progress label Mar 27, 2021
@logannc logannc self-assigned this Mar 27, 2021
@logannc
Copy link
Owner Author

logannc commented Mar 27, 2021

Okay, there is a lot here. Basically, I moved a lot into fuzzywuzzy_compatible implemented on top of various other new root crate level primitives (which, for the most part, have gracefully handled everything fuzzywuzzy wants to do). A lot got moved into fuzzywuzzy_compatible and re-exported into fuzz for now. If we improve the impls with non-compatible versions, we just stop exporting them. The apis should be the same.

I'll probably come back and edit this comment tomorrow with a more coherent/comprehensive summary.

Still stuff left to do:

  • wratio and the token_set methods need a little love. they don't fit with normalizers/segmenters so might just do minimal cleanup
  • documentation missing
  • primitives module probably needs splitting because there is a bunch of unrelated junk littered there
  • fuzzywuzzy_compatible should probably be python_compatible because we're also named fuzzywuzzy?
  • need to figure out how to replace intersperse
  • performance benchmarks/regression tests?

@logannc
Copy link
Owner Author

logannc commented Mar 27, 2021

tests are passing locally on nightly, its just intersperse giving us grief.

@logannc
Copy link
Owner Author

logannc commented Mar 27, 2021

Let's see, what else...

Processor trait vs normalizers?
Scorer vs calcs?
Redefine misc utils in terms of primitives (i.e., full_process)
Better defaults and docs on normalizes

A couple good normalizers plus grapheme segmentation is probably the best default, w/ codepoints segmentation an occasional perf boost.

I'm still not happy I can't easily represent the token set stuff and wonder if I can't unify that somehow.

Are there any other complex normalization sequences that would be useful to have default impls?

pub fn uwratio(s1: &str, s2: &str, full_process: bool) -> u8 {
// trivial check omitted because this is a shallow delegator to wratio which checks.
wratio(s1, s2, false, full_process)
// TODO: document
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Documentation needed] 🤖

/// let choices = vec!["foo", "bar", "baz"];
/// assert_eq!(extract_without_order_full(query, &choices, default_processor, default_scorer, 0.try_into().unwrap()), vec![Match{ item: &"foo", score: 0.try_into().unwrap() }, Match{ item: &"bar", score: 100.try_into().unwrap() }, Match{ item: &"baz", score: 67.try_into().unwrap() }, ]);
/// ```
pub fn extract_without_order_full<'a, 'b, A: 'a, B, C, P, S>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic nitpicking: is _full a common suffix for functions providing more behavior controls in the signature?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know! It would be good to find some prior art on this before 1.0 but its probably fine for this PR. If you think we can do better, make an issue so we remember to check it out again.

/// // assert_eq!(replace_non_letters_non_numbers_with_whitespace("abcØØØकिमपि"), "abcØØØक मप ");
/// assert_eq!(replace_non_letters_non_numbers_with_whitespace("abcØØØकिमपि"), "abcØØØकिमपि");
/// ```
pub fn replace_non_letters_non_numbers_with_whitespace(s: &str) -> String {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More nitpicking:

  • subst_wspace_for_non_alphanumeric
  • replace_non_alphanumeric_with_whitespace

/// assert_eq!(asciionly("abcØØØकिमपि"), "abc");
/// assert_eq!(asciionly("ØØØकिमपि"), "");
/// ```
pub fn asciionly(s: &str) -> String {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More nitpicking:

  • only_ascii
  • ascii_only
  • keep_ascii

}
}

// TODO: document
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO docs

This normalizer splits by whitespace and unstably sorts the input.

convert::{From, TryFrom},
};

// TODO: turn primitives.rs into
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make a branch and do this, if needed. IntelliJ Rust has a nice feature for transforming a single file into a module.

}

impl TryFrom<u8> for Score {
// TODO: pick a better 'error' type...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the comment I leave whenever I'm too lazy to define an Enum and derive thiserror::Error 😆

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea! I honestly don't know what, if any, error ecosystem we should plug into. Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wisdom for a while has been "anyhow" for binaries, "thiserror" for libraries. The macros from thiserror are simply and non-magic enough that I feel comfortable using them normally.

/// baseline query.
// TODO: link wratio
/// For example, wratio might be used to compare against strings.
pub trait Scorer<A, B> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Scorer<Q, C>?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except wratio doesn't even fit this trait. With the talk in #6, it might be best to do away with traits here and just use S: Fn(A, B) -> Score in the relevant signatures in process.rs. Then we can maybe collect simple_ratio, best_partial_ratio, and whatever else into their own scorer.rs module.

part of the issue is that the extract methods are unnecessarily generic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be best to do away with traits here and just use S: Fn(A, B) -> Score in the relevant signatures

👍 The less complexity and code, the better :)

part of the issue is that the extract methods are unnecessarily generic.

What would you change about their signatures?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about this some more and think this should get the same treatment as trait Normalizer and trait Segmenter. The stuff in fuzzywuzzy_compatible::process should be able to be adapted to the new trait.

// Basically takes the output of segmentation on two items...
trait Scorer<T>{
  fn score<T: Eq>(&self, a: &[T], b: &[T]) -> Score;
}

// Then lets implement the most basic `Scorer`, `simple_ratio`...

// ... which happens to need it's a separate trait for reasons ...
trait Aligner {
  fn align<T: Eq>(&self, a: &[T], b: &[T]) ->Vec<(usize, usize, usize)>;
}
pub struct SimpleRatioScorer<A: Aligner>(A);

impl <T: Eq, A: Aligner> Scorer<T> for SimpleRatioScorer<A> {
  fn score<T: Eq>(&self, a: &[T], b: &[T]) -> Score;
    let matches: usize = self.0.align(a, b).iter().map(|&(_, _, s)| s).sum();
    let sumlength = (a.len() + b.len()) as f32;
    if sumlength.is_normal() {
      (100.0 * (2.0 * matches as f32) / sumlength).round().try_into().unwrap()
    } else {
      100.try_into().unwrap()
    }
  }
}

pub struct BasicMatchingBlocks;

impl Aligner for BasicMatchingBlocks {
  fn align<T: Eq>(&self, a: &[T], b: &[T]) ->Vec<(usize, usize, usize)> {
    get_matching_blocks(a, b)
  }
}

pub struct SmithWaterman;

impl Aligner for SmithWaterman{
  fn align<T: Eq>(&self, a: &[T], b: &[T]) ->Vec<(usize, usize, usize)> {
    return magic();
  }
}

pub struct BestPartialScorer<A: Aligner, S: Scorer>(A, S);

impl <T: Eq, A: Aligner, S: Scorer> Scorer<T> for BestPartialScorer<A, S> {
  fn score<T: Eq>(&self, a: &[T], b: &[T]) -> Score;
    let blocks = self.0.align(a, b);
    let mut max = 0.try_into().unwrap();
    for (i, j, _) in blocks {
      let long_start = if j > i { j - i } else { 0 };
      let long_end = std::cmp::min(long_start + shorter.len(), longer.len());
      let long_substr = &longer[long_start..long_end];
      let r = self.1.score(&shorter, long_substr).try_into().unwrap();
      if r > 99 {
        return 100.try_into().unwrap();
      } else if r > max {
        max = r;
      }
    }
    max
  }
}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also means that token_set_ratio and such can implement the trait and have similar building blocks.

@logannc logannc mentioned this pull request Mar 29, 2021
@logannc
Copy link
Owner Author

logannc commented Apr 12, 2021

Welp, get_matching_blocks has a bug. for the inputs "frodo baggin" and "samwise g" (after codepoint segmentation), it generates a bad item in the queue which causes debug assertions to panic.

I think this wasn't a problem in python because the python find_longest_match does a weird iterative solution different from ours.

@maxbachmann
Copy link

maxbachmann commented Apr 19, 2021

I think this wasn't a problem in python because the python find_longest_match does a weird iterative solution different from ours.

Not sure whether it helps, but here is my C++ implementation of get_matching_blocks.
https://github.com/maxbachmann/rapidfuzz-cpp/blob/a494e5557087203f54e76e49815c68324114b9e5/rapidfuzz/details/matching_blocks.hpp#L58
It is essentially the Python implementation without all autojunk parts + hash maps to improve the performance

@logannc
Copy link
Owner Author

logannc commented Apr 20, 2021

I haven't looked at it in a few days but the problem was one of (I can't remember which offhand)

debug_assert!(low1 <= high1);
debug_assert!(low2 <= high2);
debug_assert!(high1 <= shorter.len());
debug_assert!(high2 <= longer.len());
debug_assert!(high1 - low1 <= high2 - low2);

was failing. All of these ought to be maintained for each call to find_longest_match() under the current implementation, I think.

I think your implementation suffers the same issue with

if ((spos + length) < a_high && (dpos + length) < b_high) {
    queue.emplace_back(spos + length, a_high, dpos + length, b_high);
}

I have not looked at this since I noted it last week.

@maxbachmann
Copy link

maxbachmann commented Apr 20, 2021

I will test this in my implementation. However this is directly taken from the Python implementation (and appears to be the same in your implementation as well). I would personally assume the bug is somewhere in find_longest_match, since there your using quite a different implementation. RapidFuzz uses pretty much the same implementation as Difflib in there, but since I do not use the auto junk feature I was able to get rid of the hashmap b2j

RapidFuzz

https://github.com/maxbachmann/rapidfuzz-cpp/blob/a494e5557087203f54e76e49815c68324114b9e5/rapidfuzz/details/matching_blocks.hpp#L120-L128

      if (length) {
        if (a_low < spos && b_low < dpos) {
          queue.emplace_back(a_low, spos, b_low, dpos);
        }
        if ((spos + length) < a_high && (dpos + length) < b_high) {
          queue.emplace_back(spos + length, a_high, dpos + length, b_high);
        }
        matching_blocks_pass1.emplace_back(spos, dpos, length);
      }

fuzzywuzzy-rs

if k != 0 {
matching_blocks.push((i, j, k));
if low1 < i && low2 < j {
queue.push((low1, i, low2, j));
}
if i + k < high1 && j + k < high2 {
queue.push((i + k, high1, j + k, high2));
}
}

difflib

https://github.com/python/cpython/blob/fa03efda3dc6ad118788bebc61079cd481c0b24c/Lib/difflib.py#L490-L495

            if k:   # if k is 0, there was no matching block
                matching_blocks.append(x)
                if alo < i and blo < j:
                    queue.append((alo, i, blo, j))
                if i+k < ahi and j+k < bhi:
                    queue.append((i+k, ahi, j+k, bhi))

@seanpianka seanpianka added this to the 1.0 milestone May 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wip work in progress
Projects
None yet
3 participants