[AIR] Add `batch_size` arg for `BatchMapper`. #29193

clarkzinzow · 2022-10-07T23:54:42Z

The default batch_size of 4096 at the Datasets level doesn't suffice for all use cases: it can be too large for wide tables and large images, leading to DRAM/GRAM OOms; it can be too small for narrow tables, leading to unnecessary batch slicing overhead and suboptimal vectorized operations in their UDFs. We should allow users to configure the batch_size at the AIR level.

Closes #29168

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

jiaodong

makes sense to me to get closer to map_batches, a few nits:

jiaodong · 2022-10-07T23:59:08Z

python/ray/data/tests/test_batch_mapper.py

@@ -152,6 +152,28 @@ def add_and_modify_udf_numpy(data: Union[np.ndarray, Dict[str, np.ndarray]]):
    assert_frame_equal(out_df, expected_numpy_df)


+def test_batch_mapper_batch_size():


just for better coverage can we parametrize this with other combination of data formats and numpy udf as well ?

jiaodong · 2022-10-08T00:02:59Z

python/ray/data/preprocessor.py

        if transform_type == "pandas":
-            return dataset.map_batches(self._transform_pandas, batch_format="pandas")
+            return dataset.map_batches(
+                self._transform_pandas, batch_format="pandas", **kwargs


can we narrow this function down to each field, rather than an arbitrary kwargs dict ?
there're only batch_size , compute and batch_format in map_batches that likely require explicit value, where ** is used for ray_remote_args that ideally we don't want to be mixed with param values above.

def map_batches( | | self, | | fn: BatchUDF, | | *, | | batch_size: Optional[int] = 4096, | | compute: Optional[Union[str, ComputeStrategy]] = None, | | batch_format: Literal["default", "pandas", "pyarrow", "numpy"] = "default", | | fn_args: Optional[Iterable[Any]] = None, | | fn_kwargs: Optional[Dict[str, Any]] = None, | | fn_constructor_args: Optional[Iterable[Any]] = None, | | fn_constructor_kwargs: Optional[Dict[str, Any]] = None, | | **ray_remote_args, | | ) -> "Dataset[Any]":

The point of the **kwargs passthrough is to make this friendly to current and future .map_batches() arguments, where subclasses of Preprocessor can opt in to passing any of these args, including custom preprocessors. Otherwise, we'll need update the base Preprocessor whenever exposing a new parameter, and devs/advanced users will either be blocked from using that parameter until the next release or will have to override Preprocessor._transform(). Explicitly enumerating each potential field that may or may not exist on the preprocessor will also be more complex than this kwarg passthrough.

where ** is used for ray_remote_args that ideally we don't want to be mixed with param values above.

What if we (or a user) has a preprocessor that we (they) want to run on a GPU, or it's mostly I/O-bound so we (they) to request a fractional CPU, or we (they) want to specify custom retry logic? We do want to expose these.

Since this kwarg passthrough isn't exposed to any users except for advanced users implementing custom preprocessors, is this really an issue? I feel like the future-proofing + simplicity advantages outweigh the disadvantages.

I'll take a stab at blowing these out into batch_size, compute, batch_format, and ray_remote_args getters that can be overridden, but I don't know if it will be a net benefit.

Another disadvantage of making each of these kwargs explicit is that supporting batch_size=None is more difficult, since we need to delineate between a preprocessor that doesn't set batch_size vs a preprocessor that wants to disable batching with batch_size=None. So we either need some other "not specified" indicator or start duplicating default values. which isn't great.

ah i mean it's less error prone for us to explicitly call with args
dataset.map_batches(self._transform_numpy, batch_format="numpy", batch_size={field}, **kwargs) since technically we can also hide the batch_format kwarg completely

Hmm what error do you see arising there, could you give an example?

since technically we can also hide the batch_format kwarg completely

batch_format is hardcoded since we have generic logic in Preprocessor to determine the appropriate batch format, while the rest (including batch_size) are determined entirely by the subclasses, so I think that difference in treatment is reasonable.

I can pop batch_size out of the kwargs and pass it explicitly, but if we're still passing the rest as a **kwargs passthrough, I'm not sure what that would give us.

i think i might missed some context you had, or future plans regarding how we handle batch sizes as always being explicit with its value upon calling each map_batches is more appealing to me. But let's consolidate the discussion in the API discussion below :)

amogkam

Can we hold off on merging this in until we get to an API decision consensus here: #29229?

Just want to make sure we are very intentional on our API changes so that we don't change them around too frequently.

cc @matthewdeng

clarkzinzow · 2022-10-11T22:05:42Z

@amogkam Sounds good!

Signed-off-by: Amog Kamsetty <[email protected]>

…feat/batch-mapper-batch-size Signed-off-by: Amog Kamsetty <[email protected]>

Signed-off-by: Amog Kamsetty <[email protected]>

clarkzinzow · 2022-10-31T19:44:41Z

@amogkam Any particular reason you removed the added test coverage from test_dataset_pandas.py?

I added coverage in this commit: 98b8ceb#diff-18aba391540898e44ef8a3f57b3548ec59ed954d8dbde60438b66f24f831cd35
You removed coverage in this commit: 41cc526#diff-18aba391540898e44ef8a3f57b3548ec59ed954d8dbde60438b66f24f831cd35

clarkzinzow · 2022-10-31T19:48:12Z

@amogkam Also, why were these changes necessary? f46e683

amogkam · 2022-10-31T19:51:00Z

@clarkzinzow these changes are already in master

clarkzinzow · 2022-10-31T19:54:46Z

@amogkam Ah got it, I forgot that I pulled out the .to_pandas() fix into another PR! Thanks for bringing this PR inline with current master.

stephanie-wang

Thanks for making this change!

c21

LGTM with one question. Thanks @clarkzinzow.

c21 · 2022-10-31T20:14:27Z

python/ray/data/preprocessors/batch_mapper.py

@@ -68,6 +76,7 @@ def __init__(
            ],
        ],
        batch_format: Optional[str] = None,
+        batch_size: int = DEFAULT_BATCH_SIZE,


Do we want to declare batch_size: Optional[int], so user have flexbility to do batch_size=None to consume full block as one batch?

I guess it's not a big deal for now, but want to avoid future change for public APIs when we find more edge cases to support.

yes, good point!

python/ray/data/preprocessors/batch_mapper.py

Signed-off-by: Amog Kamsetty <[email protected]>

…nzow/ray into air/feat/batch-mapper-batch-size

…feat/batch-mapper-batch-size

jiaodong · 2022-10-31T22:23:34Z

python/ray/data/tests/test_batch_mapper.py

+        return batch
+
+    batch_mapper = BatchMapper(
+        fn=check_batch_size, batch_size=batch_size, batch_format="pandas"


have one for batch_format="numpy" as well ?

python/ray/data/tests/test_batch_mapper.py

jiaodong

lg -- only comments about enhancing test coverage of batch_size in various cases.

Signed-off-by: Amog Kamsetty <[email protected]>

The default batch_size of 4096 at the Datasets level doesn't suffice for all use cases: it can be too large for wide tables and large images, leading to DRAM/GRAM OOms; it can be too small for narrow tables, leading to unnecessary batch slicing overhead and suboptimal vectorized operations in their UDFs. We should allow users to configure the batch_size at the AIR level. Closes ray-project#29168 Signed-off-by: Amog Kamsetty <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]> Signed-off-by: Weichen Xu <[email protected]>

Add batch_size arg for BatchMapper.

6d07407

clarkzinzow requested review from ericl, scv119, jjyao, jianoaix, c21 and a team as code owners October 7, 2022 23:54

clarkzinzow assigned matthewdeng, richardliaw, jiaodong and amogkam Oct 7, 2022

jiaodong reviewed Oct 8, 2022

View reviewed changes

Add more test coverage, fix ds.to_pandas().

98b8ceb

amogkam mentioned this pull request Oct 11, 2022

[RFC] [AIR] Batch size for preprocessor transforms #29229

Closed

amogkam requested changes Oct 11, 2022

View reviewed changes

amogkam added 4 commits October 31, 2022 12:31

simplify and update

41cc526

Signed-off-by: Amog Kamsetty <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

501e679

…feat/batch-mapper-batch-size Signed-off-by: Amog Kamsetty <[email protected]>

formatting

3e8c27d

Signed-off-by: Amog Kamsetty <[email protected]>

update

f46e683

Signed-off-by: Amog Kamsetty <[email protected]>

amogkam approved these changes Oct 31, 2022

View reviewed changes

amogkam assigned stephanie-wang and c21 Oct 31, 2022

stephanie-wang approved these changes Oct 31, 2022

View reviewed changes

c21 approved these changes Oct 31, 2022

View reviewed changes

amogkam reviewed Oct 31, 2022

View reviewed changes

python/ray/data/preprocessors/batch_mapper.py Outdated Show resolved Hide resolved

amogkam and others added 4 commits October 31, 2022 13:18

Update python/ray/data/preprocessors/batch_mapper.py

6c8dacc

Signed-off-by: Amog Kamsetty <[email protected]>

update

2dd63fd

Signed-off-by: Amog Kamsetty <[email protected]>

Merge branch 'air/feat/batch-mapper-batch-size' of github.com:clarkzi…

d6d323e

…nzow/ray into air/feat/batch-mapper-batch-size

Merge branch 'master' of https://github.com/ray-project/ray into air/…

1912b06

…feat/batch-mapper-batch-size

jiaodong reviewed Oct 31, 2022

View reviewed changes

python/ray/data/tests/test_batch_mapper.py Show resolved Hide resolved

jiaodong reviewed Oct 31, 2022

View reviewed changes

amogkam added 5 commits October 31, 2022 17:59

update

2af88e1

Signed-off-by: Amog Kamsetty <[email protected]>

update

5079dd4

Signed-off-by: Amog Kamsetty <[email protected]>

more test coverage

cd4cf7f

Signed-off-by: Amog Kamsetty <[email protected]>

fix test

043478b

Signed-off-by: Amog Kamsetty <[email protected]>

fix test

85e9cae

Signed-off-by: Amog Kamsetty <[email protected]>

amogkam merged commit 28a2959 into ray-project:master Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Add `batch_size` arg for `BatchMapper`. #29193

[AIR] Add `batch_size` arg for `BatchMapper`. #29193

clarkzinzow commented Oct 7, 2022 •

edited by amogkam

Loading

jiaodong left a comment

jiaodong Oct 7, 2022

jiaodong Oct 8, 2022

clarkzinzow Oct 11, 2022

clarkzinzow Oct 11, 2022

clarkzinzow Oct 11, 2022

jiaodong Oct 11, 2022

clarkzinzow Oct 11, 2022

jiaodong Oct 11, 2022

amogkam left a comment •

edited

Loading

clarkzinzow commented Oct 11, 2022

clarkzinzow commented Oct 31, 2022 •

edited

Loading

clarkzinzow commented Oct 31, 2022

amogkam commented Oct 31, 2022

clarkzinzow commented Oct 31, 2022

stephanie-wang left a comment

c21 left a comment

c21 Oct 31, 2022

amogkam Oct 31, 2022

jiaodong Oct 31, 2022

jiaodong left a comment

		@@ -152,6 +152,28 @@ def add_and_modify_udf_numpy(data: Union[np.ndarray, Dict[str, np.ndarray]]):
		assert_frame_equal(out_df, expected_numpy_df)


		def test_batch_mapper_batch_size():

[AIR] Add batch_size arg for BatchMapper. #29193

[AIR] Add batch_size arg for BatchMapper. #29193

Conversation

clarkzinzow commented Oct 7, 2022 • edited by amogkam Loading

Checks

jiaodong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam left a comment • edited Loading

Choose a reason for hiding this comment

clarkzinzow commented Oct 11, 2022

clarkzinzow commented Oct 31, 2022 • edited Loading

clarkzinzow commented Oct 31, 2022

amogkam commented Oct 31, 2022

clarkzinzow commented Oct 31, 2022

stephanie-wang left a comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiaodong left a comment

Choose a reason for hiding this comment

[AIR] Add `batch_size` arg for `BatchMapper`. #29193

[AIR] Add `batch_size` arg for `BatchMapper`. #29193

clarkzinzow commented Oct 7, 2022 •

edited by amogkam

Loading

amogkam left a comment •

edited

Loading

clarkzinzow commented Oct 31, 2022 •

edited

Loading