Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

708 replace fuzzywuzzy and textdistance with rapidfuzz for plain evaluation metrics #709

Open
wants to merge 22 commits into
base: development
Choose a base branch
from

Conversation

prvenk
Copy link
Collaborator

@prvenk prvenk commented Sep 14, 2024

In this PR, we modify the following:

  • fuzz token set ratio from fuzzywuzzy with the much faster rapidfuzz version. we also make this more flexible with an option to measure other kinds of fuzz metrics, namely: ratio, partial ratio, token_sort_ratio, token_sort_partial_ratio, and token_set_partial_ratio.
  • we replace some plain metrics from textdistance with the rapidfuzz version (this is faster as documented in the issue). the metrics we replace are lccseq, hamming, jarowinkler, levenshtein. We retain textdistance variants for cosine and jaccard given rapidfuzz doesn't have a version.
  • Added rouge scores as plain string metrics
  • Renaming variables for clarity, e.g., doc1 to str1 and value1 to str1 given all inputs are supposed to be strings and these were not consistent.
  • Adding types for arguments and return.
  • Editing documentation to reflect the changes
  • Added tests

@prvenk prvenk added the enhancement New feature or request label Sep 14, 2024
@prvenk prvenk self-assigned this Sep 14, 2024
@prvenk prvenk marked this pull request as draft September 14, 2024 01:56
@prvenk prvenk changed the base branch from prerelease to development September 14, 2024 01:56
@prvenk prvenk linked an issue Sep 14, 2024 that may be closed by this pull request
@prvenk prvenk marked this pull request as ready for review September 14, 2024 01:57
@martinpeck martinpeck removed their request for review September 16, 2024 10:39
@martinpeck
Copy link
Member

Sorry...I don't think I have the background context to review this PR properly.

Copy link
Collaborator

@kcortinas kcortinas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @prvenk! Added a few comments and suggestions.

README.md Outdated Show resolved Hide resolved
config.sample.json Outdated Show resolved Hide resolved
docs/evaluation-metrics.md Outdated Show resolved Hide resolved
docs/evaluation-metrics.md Show resolved Hide resolved
docs/evaluation-metrics.md Outdated Show resolved Hide resolved
.github/workflows/config.json Show resolved Hide resolved
Copy link
Collaborator Author

@prvenk prvenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

docs/evaluation-metrics.md Show resolved Hide resolved
Copy link
Collaborator

@kcortinas kcortinas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace fuzzywuzzy and some textdistance evals with rapidfuzz
5 participants