-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade duplicate finding logic with edit distance #11
Comments
We could use Levenshtein or difflib.SequenceMatcher. There is a stackoverflow answer comparing the Performance and Tradeoffs of both |
@vinayak-mehta is this issue still open? |
@Sangarshanan was working on this. If he is not working on it, then you can take this up :) |
@Sangarshanan can you please confirm if you are working or not. |
I am not working on this currently. You could take this up if you want @nishant-sethi |
@vinayak-mehta I'll pick it up. @Sangarshanan can you please tell me in which file changes are required. |
Thank you for picking this up @nishant-sethi :) You could add it as a utils functions that could get picked up by import cli command. Also I would suggest you use Levenshtein similarity and set the threshold a bit higher, around 0.9 |
@Sangarshanan Thanks for the suggestion. I'll try to work on this today and raise PR based on which you can provide feedback. |
Hi pps! Do you keep working on this @nishant-sethi? I can take it if it's free. |
@JosemyDuarte you can take this up |
Hi @vinayak-mehta! I just opened a PR for this. Whenever you can, please let me know your feedback. |
We could have a threshold (
0.5
or0.7
) above which a conference could be flagged as a duplicate during import.The text was updated successfully, but these errors were encountered: