We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deduplicate function is very slow even on relatively small datasets.
from skrub.datasets import make_deduplication_data from skrub import deduplicate duplicated = make_deduplication_data(examples=['black', 'white', 'red', 'blue', 'green'], entries_per_example=[500, 500, 500, 500, 500], prob_mistake_per_letter=0.3) deduplicate(duplicated)
Faster results with parallelization, there are some for loops that are not optimal
Slow, takes ~4 minutes on my laptop for the example above.
Current unreleased version
The text was updated successfully, but these errors were encountered:
Successfully merging a pull request may close this issue.
Describe the bug
deduplicate function is very slow even on relatively small datasets.
Steps/Code to Reproduce
Expected Results
Faster results with parallelization, there are some for loops that are not optimal
Actual Results
Slow, takes ~4 minutes on my laptop for the example above.
Versions
The text was updated successfully, but these errors were encountered: