FEA Parallelize deduplicate function #618

jovan-stojanovic · 2023-06-23T13:32:11Z

Fix #617

LilianBoulard

Thanks for the PR, here are a few comments

skrub/_deduplicate.py

skrub/datasets/_generating.py

LilianBoulard · 2023-06-26T09:16:28Z

skrub/tests/test_deduplicate.py

+@skip_if_no_parallel
+def test_parallelism():
+    # Test that parallelism works
+    X = make_deduplication_data(examples=['black', 'white'],
+                                entries_per_example=[15, 15])
+    y = deduplicate(X, n_jobs=None)
+    for n_jobs in [2, 1, -1]:
+        y_parallel = deduplicate(X, n_jobs=n_jobs)
+        assert_array_equal(y, y_parallel)
+
+    # Test with threading backend
+    with joblib.parallel_backend("threading"):
+        y_threading = deduplicate(X, n_jobs=1)
+    assert_array_equal(y, y_threading)


I feel like this test doesn't really check for the parallelization: the function could ignore the n_jobs parameter, and the test would pass anyway.
I think we should actually time the calls with different n_jobs and assert that a higher number of jobs would make the call faster.

Ok, I see what you mean, but a higher number of jobs doesn't always make the call faster.
It does so when the data is big enough, which is why we need to add it. But adding such an example to the test suite may make our tests super slow to run each time, so I don't know if this is a good solution..

Ok, I managed to add a test that pass + does not seem too heavy

CHANGES.rst

Co-authored-by: Lilian <[email protected]>

LilianBoulard

Thanks, here's another pass

skrub/tests/test_deduplicate.py

LilianBoulard · 2023-07-17T09:10:41Z

skrub/tests/test_deduplicate.py

+    with joblib.parallel_backend("threading"):
+        y_threading = deduplicate(X, n_jobs=1)


Why not test the time with both backends? To simplify the code, it could probably be parametrized.

After doing so, threading gives terrible performance (after discussing with @jeremiedbb, probably because of the GIL), so right now, we don't use threading, just process-based backends (and leave it default, loky).

GaelVaroquaux

Minor suggestions which I am going to apply

CHANGES.rst

skrub/tests/test_deduplicate.py

GaelVaroquaux · 2023-07-17T09:24:21Z

skrub/tests/test_deduplicate.py

+DEFAULT_JOBLIB_BACKEND = joblib.parallel.get_active_backend()[0].__class__
+
+
+class DummyBackend(DEFAULT_JOBLIB_BACKEND):  # type: ignore


Fun! Maybe we should add such a pattern in joblib and document it in a "how to test with joblib",

cc @tomMoral

Yes, maybe adding a way to test that the parallel_config is used by downstream libraries would be nice.

skrub/tests/test_deduplicate.py

Co-authored-by: Gael Varoquaux <[email protected]>

tomMoral · 2023-07-17T11:01:51Z

skrub/tests/test_deduplicate.py

+                                entries_per_example=[15, 15])
+    deduplicate(X, n_jobs=2)
+
+    with joblib.parallel_backend("testing") as (ba, n_jobs):


With the recent version of joblib, the use parallel_config is advised. (possible if you support python3.8+).

tomMoral · 2023-07-17T11:03:25Z

skrub/tests/test_deduplicate.py

+DEFAULT_JOBLIB_BACKEND = joblib.parallel.get_active_backend()[0].__class__
+
+
+class DummyBackend(DEFAULT_JOBLIB_BACKEND):  # type: ignore


Yes, maybe adding a way to test that the parallel_config is used by downstream libraries would be nice.

…y because of the GIL)

GaelVaroquaux

LGTM. Merging

jovan-stojanovic added 3 commits June 23, 2023 15:08

Parallelize deduplicate

3c0c2a5

add changelog

4dfe396

add more test samples

72ae4ea

LilianBoulard reviewed Jun 26, 2023

View reviewed changes

jovan-stojanovic and others added 8 commits June 26, 2023 11:28

Update skrub/_deduplicate.py

30551d9

Co-authored-by: Lilian <[email protected]>

Update skrub/_deduplicate.py

fc19102

Co-authored-by: Lilian <[email protected]>

Update skrub/_deduplicate.py

e45eda7

Co-authored-by: Lilian <[email protected]>

Update skrub/datasets/_generating.py

4bb9082

Co-authored-by: Lilian <[email protected]>

fix error

538880c

add more test sample

d132188

add changelog

8e824e3

simplify comparison

b4e2d97

jovan-stojanovic requested a review from LilianBoulard June 27, 2023 09:56

LilianBoulard requested changes Jul 17, 2023

View reviewed changes

GaelVaroquaux requested changes Jul 17, 2023

View reviewed changes

LilianBoulard and others added 2 commits July 17, 2023 11:39

Apply suggestions from code review

421ae3a

Co-authored-by: Gael Varoquaux <[email protected]>

merge with main

4de00ea

tomMoral reviewed Jul 17, 2023

View reviewed changes

LilianBoulard added 3 commits July 17, 2023 15:46

Prefer processes, since threading gives terrible performance (probabl…

13d9b6d

…y because of the GIL)

Update parallelism test

20f24e2

Update changelog

bb44e6b

LilianBoulard approved these changes Jul 17, 2023

View reviewed changes

GaelVaroquaux approved these changes Jul 18, 2023

View reviewed changes

GaelVaroquaux merged commit a7cfc33 into skrub-data:main Jul 18, 2023
19 checks passed

jovan-stojanovic deleted the parallel_deduplicate branch July 21, 2023 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEA Parallelize deduplicate function #618

FEA Parallelize deduplicate function #618

jovan-stojanovic commented Jun 23, 2023

LilianBoulard left a comment

LilianBoulard Jun 26, 2023 •

edited

Loading

jovan-stojanovic Jun 26, 2023

jovan-stojanovic Jun 26, 2023

LilianBoulard left a comment

LilianBoulard Jul 17, 2023

LilianBoulard Jul 17, 2023

GaelVaroquaux left a comment

GaelVaroquaux Jul 17, 2023

tomMoral Jul 17, 2023

tomMoral Jul 17, 2023

tomMoral Jul 17, 2023

GaelVaroquaux left a comment

		with joblib.parallel_backend("threading"):
		y_threading = deduplicate(X, n_jobs=1)

		DEFAULT_JOBLIB_BACKEND = joblib.parallel.get_active_backend()[0].__class__


		class DummyBackend(DEFAULT_JOBLIB_BACKEND): # type: ignore

FEA Parallelize deduplicate function #618

FEA Parallelize deduplicate function #618

Conversation

jovan-stojanovic commented Jun 23, 2023

LilianBoulard left a comment

Choose a reason for hiding this comment

LilianBoulard Jun 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LilianBoulard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux left a comment

Choose a reason for hiding this comment

LilianBoulard Jun 26, 2023 •

edited

Loading