Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade duplicate finding logic with edit distance #11

Closed
vinayak-mehta opened this issue Oct 28, 2019 · 11 comments
Closed

Upgrade duplicate finding logic with edit distance #11

vinayak-mehta opened this issue Oct 28, 2019 · 11 comments
Labels
enhancement New feature or request

Comments

@vinayak-mehta
Copy link
Owner

We could have a threshold (0.5 or 0.7) above which a conference could be flagged as a duplicate during import.

@vinayak-mehta vinayak-mehta added enhancement New feature or request hacktoberfest labels Oct 28, 2019
@Sangarshanan
Copy link
Contributor

We could use Levenshtein or difflib.SequenceMatcher. There is a stackoverflow answer comparing the Performance and Tradeoffs of both

@nishant-sethi
Copy link

@vinayak-mehta is this issue still open?

@vinayak-mehta
Copy link
Owner Author

@Sangarshanan was working on this. If he is not working on it, then you can take this up :)

@nishant-sethi
Copy link

@Sangarshanan can you please confirm if you are working or not.

@Sangarshanan
Copy link
Contributor

I am not working on this currently. You could take this up if you want @nishant-sethi

@nishant-sethi
Copy link

@vinayak-mehta I'll pick it up. @Sangarshanan can you please tell me in which file changes are required.
@vinayak-mehta Can you please me the overview about what change is required.

@Sangarshanan
Copy link
Contributor

Thank you for picking this up @nishant-sethi :)

You could add it as a utils functions that could get picked up by import cli command. Also I would suggest you use Levenshtein similarity and set the threshold a bit higher, around 0.9

@nishant-sethi
Copy link

@Sangarshanan Thanks for the suggestion. I'll try to work on this today and raise PR based on which you can provide feedback.

@JosemyDuarte
Copy link
Contributor

Hi pps! Do you keep working on this @nishant-sethi? I can take it if it's free.

@nishant-sethi
Copy link

@JosemyDuarte you can take this up

@JosemyDuarte
Copy link
Contributor

Hi @vinayak-mehta! I just opened a PR for this. Whenever you can, please let me know your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants