Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deduplicate is slow #617

Closed
jovan-stojanovic opened this issue Jun 23, 2023 · 0 comments · Fixed by #618
Closed

deduplicate is slow #617

jovan-stojanovic opened this issue Jun 23, 2023 · 0 comments · Fixed by #618
Labels
bug Something isn't working

Comments

@jovan-stojanovic
Copy link
Member

Describe the bug

deduplicate function is very slow even on relatively small datasets.

Steps/Code to Reproduce

from skrub.datasets import make_deduplication_data
from skrub import deduplicate

duplicated = make_deduplication_data(examples=['black', 'white', 'red', 'blue', 'green'], entries_per_example=[500, 500, 500, 500, 500], prob_mistake_per_letter=0.3)

deduplicate(duplicated)

Expected Results

Faster results with parallelization, there are some for loops that are not optimal

Actual Results

Slow, takes ~4 minutes on my laptop for the example above.

Versions

Current unreleased version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant