-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Parallelize deduplicate function #618
FEA Parallelize deduplicate function #618
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, here are a few comments
skrub/tests/test_deduplicate.py
Outdated
@skip_if_no_parallel | ||
def test_parallelism(): | ||
# Test that parallelism works | ||
X = make_deduplication_data(examples=['black', 'white'], | ||
entries_per_example=[15, 15]) | ||
y = deduplicate(X, n_jobs=None) | ||
for n_jobs in [2, 1, -1]: | ||
y_parallel = deduplicate(X, n_jobs=n_jobs) | ||
assert_array_equal(y, y_parallel) | ||
|
||
# Test with threading backend | ||
with joblib.parallel_backend("threading"): | ||
y_threading = deduplicate(X, n_jobs=1) | ||
assert_array_equal(y, y_threading) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this test doesn't really check for the parallelization: the function could ignore the n_jobs
parameter, and the test would pass anyway.
I think we should actually time the calls with different n_jobs
and assert that a higher number of jobs would make the call faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see what you mean, but a higher number of jobs doesn't always make the call faster.
It does so when the data is big enough, which is why we need to add it. But adding such an example to the test suite may make our tests super slow to run each time, so I don't know if this is a good solution..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I managed to add a test that pass + does not seem too heavy
Co-authored-by: Lilian <[email protected]>
Co-authored-by: Lilian <[email protected]>
Co-authored-by: Lilian <[email protected]>
Co-authored-by: Lilian <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, here's another pass
skrub/tests/test_deduplicate.py
Outdated
with joblib.parallel_backend("threading"): | ||
y_threading = deduplicate(X, n_jobs=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not test the time with both backends? To simplify the code, it could probably be parametrized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After doing so, threading gives terrible performance (after discussing with @jeremiedbb, probably because of the GIL), so right now, we don't use threading, just process-based backends (and leave it default, loky).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestions which I am going to apply
DEFAULT_JOBLIB_BACKEND = joblib.parallel.get_active_backend()[0].__class__ | ||
|
||
|
||
class DummyBackend(DEFAULT_JOBLIB_BACKEND): # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun! Maybe we should add such a pattern in joblib and document it in a "how to test with joblib",
cc @tomMoral
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe adding a way to test that the parallel_config
is used by downstream libraries would be nice.
Co-authored-by: Gael Varoquaux <[email protected]>
entries_per_example=[15, 15]) | ||
deduplicate(X, n_jobs=2) | ||
|
||
with joblib.parallel_backend("testing") as (ba, n_jobs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the recent version of joblib
, the use parallel_config
is advised. (possible if you support python3.8+
).
DEFAULT_JOBLIB_BACKEND = joblib.parallel.get_active_backend()[0].__class__ | ||
|
||
|
||
class DummyBackend(DEFAULT_JOBLIB_BACKEND): # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe adding a way to test that the parallel_config
is used by downstream libraries would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Merging
Fix #617