Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Parallelize deduplicate function #618

Merged
merged 16 commits into from
Jul 18, 2023

Conversation

jovan-stojanovic
Copy link
Member

Fix #617

Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, here are a few comments

skrub/_deduplicate.py Outdated Show resolved Hide resolved
skrub/_deduplicate.py Outdated Show resolved Hide resolved
skrub/_deduplicate.py Outdated Show resolved Hide resolved
skrub/datasets/_generating.py Outdated Show resolved Hide resolved
Comment on lines 102 to 115
@skip_if_no_parallel
def test_parallelism():
# Test that parallelism works
X = make_deduplication_data(examples=['black', 'white'],
entries_per_example=[15, 15])
y = deduplicate(X, n_jobs=None)
for n_jobs in [2, 1, -1]:
y_parallel = deduplicate(X, n_jobs=n_jobs)
assert_array_equal(y, y_parallel)

# Test with threading backend
with joblib.parallel_backend("threading"):
y_threading = deduplicate(X, n_jobs=1)
assert_array_equal(y, y_threading)
Copy link
Member

@LilianBoulard LilianBoulard Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this test doesn't really check for the parallelization: the function could ignore the n_jobs parameter, and the test would pass anyway.
I think we should actually time the calls with different n_jobs and assert that a higher number of jobs would make the call faster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see what you mean, but a higher number of jobs doesn't always make the call faster.
It does so when the data is big enough, which is why we need to add it. But adding such an example to the test suite may make our tests super slow to run each time, so I don't know if this is a good solution..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I managed to add a test that pass + does not seem too heavy

CHANGES.rst Outdated Show resolved Hide resolved
Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, here's another pass

skrub/tests/test_deduplicate.py Outdated Show resolved Hide resolved
skrub/tests/test_deduplicate.py Outdated Show resolved Hide resolved
Comment on lines 124 to 125
with joblib.parallel_backend("threading"):
y_threading = deduplicate(X, n_jobs=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not test the time with both backends? To simplify the code, it could probably be parametrized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing so, threading gives terrible performance (after discussing with @jeremiedbb, probably because of the GIL), so right now, we don't use threading, just process-based backends (and leave it default, loky).

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions which I am going to apply

CHANGES.rst Outdated Show resolved Hide resolved
skrub/tests/test_deduplicate.py Outdated Show resolved Hide resolved
DEFAULT_JOBLIB_BACKEND = joblib.parallel.get_active_backend()[0].__class__


class DummyBackend(DEFAULT_JOBLIB_BACKEND): # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun! Maybe we should add such a pattern in joblib and document it in a "how to test with joblib",

cc @tomMoral

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe adding a way to test that the parallel_config is used by downstream libraries would be nice.

skrub/tests/test_deduplicate.py Show resolved Hide resolved
entries_per_example=[15, 15])
deduplicate(X, n_jobs=2)

with joblib.parallel_backend("testing") as (ba, n_jobs):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the recent version of joblib, the use parallel_config is advised. (possible if you support python3.8+).

DEFAULT_JOBLIB_BACKEND = joblib.parallel.get_active_backend()[0].__class__


class DummyBackend(DEFAULT_JOBLIB_BACKEND): # type: ignore

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe adding a way to test that the parallel_config is used by downstream libraries would be nice.

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Merging

@GaelVaroquaux GaelVaroquaux merged commit a7cfc33 into skrub-data:main Jul 18, 2023
19 checks passed
@jovan-stojanovic jovan-stojanovic deleted the parallel_deduplicate branch July 21, 2023 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

deduplicate is slow
4 participants