[Data] Use function name for progress bars in ray dataset #31526

peytondmurray · 2023-01-08T21:57:32Z

Why are these changes needed?

This PR changes map_batches to display an informative name in the progress bar that is created during calls to ray.data.dataset.Dataset.map_batches instead of a generic "map_batches" message, so that a user knows what function is being computed for a given display bar. If fn is a bound method of a Preprocessor subclass, the progress bar displays the name of the class. Otherwise the MapBatches(fn.__name__) is displayed; if the __name__ attribute is not present on the callable passed to map_batches, type(fn) is used instead; this happens if the callable is a class instance, for example.

Related issue number

Closes #31224.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

c21 · 2023-01-18T19:00:06Z

Hi @peytondmurray, thanks for opening up the PR! Could you rebase the PR to latest master, and fix the unit test failures? Thanks.

peytondmurray · 2023-01-18T19:18:02Z

Yep, working on the test failures now. Unfortunately it seems like stdout is being examined in many tests across the code base, so anything that calls map_batches needs to be modified. Might be something to consider looking at again in the future if we want to refactor some of the tests.

peytondmurray · 2023-01-19T09:19:30Z

Okay, I think test failures appear to be unrelated.

python/ray/air/tests/test_dataset_config.py

python/ray/data/dataset.py

python/ray/air/tests/test_dataset_config.py

python/ray/data/tests/test_dataset.py

.bazeliskrc

python/ray/air/tests/test_dataset_config.py

peytondmurray · 2023-01-21T17:32:48Z

@bveeramani Love the idea. There's a little trouble in that the function that gets passed to Dataset.map_batches when a preprocessor is specified to DataParallelTrainer (as is the case in ray/python/ray/air/tests/test_dataset_config.py::test_stream_inf_window_cache_prep) is a <bound method BatchMapper._transform_pandas of BatchMapper(fn=rand)> - that is, it's a method of a Preprocessor subclass, not the subclass itself. This means that's it's not as straightforward to check whether the mapping function is a Preprocessor method.

It might be possible to do this by examining fn.__qualname__ to get the fully qualified name; then strip the function name off the end; then systematically test each segment to see if it is in globals() to look for the class the method is attached to. Then we can call issubclass(cls, Preprocessor) to definitively identify if it is a Preprocessor and to retrieve the correct class name. Maybe there's a simpler way of doing this?

Edit Okay, so one alternative I came up with is to attach the BatchMapper class as an attribute to the function being passed to map_batches to allow me to use it in status messages there. Maybe this is acceptable?

Edit 2 Never mind, you can get the class name from a bound method with __self__. Implementing now!

peytondmurray · 2023-01-23T17:27:57Z

@bveeramani @ericl Preprocessors now correctly show their class names; here's what the progress bars currently look like:

$ pytest -s test_batch_predictor.py::test_separate_gpu_stage
Test session starts (platform: linux, Python 3.10.8, pytest 7.2.0, pytest-sugar 0.9.6)
rootdir: /home/pdmurray/Desktop/workspace/ray, configfile: pytest.ini
plugins: clarity-1.0.1, sugar-0.9.6, typeguard-2.13.3, anyio-3.6.2
collecting ... 2023-01-23 09:14:16,764	INFO worker.py:1546 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
2023-01-23 09:14:18,290	WARNING read_api.py:330 -- ⚠️  The number of blocks in this dataset (10) limits its parallelism to 10 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks.
2023-01-23 09:14:18,298	INFO batch_predictor.py:184 -- `num_gpus_per_worker` is set for `BatchPreditor`.Automatically enabling GPU prediction for this predictor. To disable set `use_gpu` to `False` in `BatchPredictor.predict`.
Read->Dummypreprocessor: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 12.83it/s]
2023-01-23 09:14:19,096	WARNING compute.py:543 -- `batch_size` is set to 4096, which reduces parallelism from 10 to 1. If the performance is worse than expected, this may indicate that the batch size is too large or the input block size is too small. To reduce batch size, consider decreasing `batch_size` or use the default in `map_batches`. To increase input block size, consider decreasing `parallelism` in read.
Map Progress (1 actors 1 pending): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.01s/it]
Shuffle Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 109.23it/s]
Shuffle Reduce: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 195.17it/s]
2023-01-23 09:14:20,129	WARNING read_api.py:330 -- ⚠️  The number of blocks in this dataset (10) limits its parallelism to 10 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks.
Read: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1568.43it/s]
2023-01-23 09:14:20,148	WARNING compute.py:543 -- `batch_size` is set to 4096, which reduces parallelism from 10 to 1. If the performance is worse than expected, this may indicate that the batch size is too large or the input block size is too small. To reduce batch size, consider decreasing `batch_size` or use the default in `map_batches`. To increase input block size, consider decreasing `parallelism` in read.
Map Progress (1 actors 1 pending): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.98s/it]
Shuffle Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 155.07it/s]
Shuffle Reduce: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 275.09it/s]

 python/ray/train/tests/test_batch_predictor.py ✓                                                                                                                                                  100% ██████████

Results (10.72s):
       1 passed

ericl · 2023-01-26T20:05:45Z

Oh title() is super weird, I don't know why anyone would want it force-lowercasing the other chars. How about we use this replacement function instead?

>>> def capfirst(s):
...     return s[:1].upper() + s[1:]
... 
>>> def capitalize(s):
...    return "".join(capfirst(x) for x in s.split("_"))
... 
>>> capitalize("map_batches")
'MapBatches'
>>> capitalize(capitalize("map_batches"))
'MapBatches'

peytondmurray · 2023-01-26T20:11:32Z

Sure, I'll try this and see what the impact on other tests is.

Signed-off-by: pdmurray <[email protected]>

This PR is a quick fix to remove the non-useful comment introduced in #31526, probably during debugging. Replace the comment with a meaningful one.

…ect#32020) This PR is a quick fix to remove the non-useful comment introduced in ray-project#31526, probably during debugging. Replace the comment with a meaningful one.

…32411) This is to fix the Dataset.__repr__ issue in #32410, after we introduce function name in #31526. We should only make operator/stage name to be camel case. Signed-off-by: Cheng Su <[email protected]>

…ay-project#32411) This is to fix the Dataset.__repr__ issue in ray-project#32410, after we introduce function name in ray-project#31526. We should only make operator/stage name to be camel case. Signed-off-by: Cheng Su <[email protected]>

…32411) (#32434) This is to fix the Dataset.__repr__ issue in #32410, after we introduce function name in #31526. We should only make operator/stage name to be camel case. Signed-off-by: Cheng Su <[email protected]>

Signed-off-by: pdmurray <[email protected]> Signed-off-by: pdmurray <[email protected]> Signed-off-by: Edward Oakes <[email protected]>

…ect#32020) This PR is a quick fix to remove the non-useful comment introduced in ray-project#31526, probably during debugging. Replace the comment with a meaningful one. Signed-off-by: Edward Oakes <[email protected]>

…ay-project#32411) This is to fix the Dataset.__repr__ issue in ray-project#32410, after we introduce function name in ray-project#31526. We should only make operator/stage name to be camel case. Signed-off-by: Cheng Su <[email protected]> Signed-off-by: Edward Oakes <[email protected]>

peytondmurray force-pushed the dataset-progress-bar-names branch from 277878c to b728e5e Compare January 11, 2023 11:32

c21 assigned c21 and scottjlee Jan 18, 2023

peytondmurray force-pushed the dataset-progress-bar-names branch 3 times, most recently from b78bd4d to 81a0850 Compare January 19, 2023 07:00

peytondmurray marked this pull request as ready for review January 19, 2023 09:19

peytondmurray requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners January 19, 2023 09:19

peytondmurray changed the title ~~[WIP] Use function name for progress bars in ray dataset~~ [Data] Use function name for progress bars in ray dataset Jan 19, 2023

scottjlee approved these changes Jan 19, 2023

View reviewed changes

python/ray/air/tests/test_dataset_config.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

ericl requested changes Jan 19, 2023

View reviewed changes

python/ray/air/tests/test_dataset_config.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_dataset.py Outdated Show resolved Hide resolved

ericl self-assigned this Jan 19, 2023

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 19, 2023

bveeramani reviewed Jan 20, 2023

View reviewed changes

.bazeliskrc Show resolved Hide resolved

python/ray/air/tests/test_dataset_config.py Outdated Show resolved Hide resolved

python/ray/air/tests/test_dataset_config.py Outdated Show resolved Hide resolved

peytondmurray force-pushed the dataset-progress-bar-names branch 4 times, most recently from 74ac9d2 to 68bb2e1 Compare January 22, 2023 00:28

peytondmurray requested review from ericl and bveeramani January 23, 2023 17:28

peytondmurray force-pushed the dataset-progress-bar-names branch 2 times, most recently from 29a0e8a to 70d40f3 Compare January 26, 2023 20:35

Add informative progress bar names to map_batches

2bcb03e

Signed-off-by: pdmurray <[email protected]>

peytondmurray force-pushed the dataset-progress-bar-names branch from 70d40f3 to 2bcb03e Compare January 27, 2023 00:15

peytondmurray removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 27, 2023

ericl approved these changes Jan 27, 2023

View reviewed changes

ericl merged commit 3343c76 into ray-project:master Jan 27, 2023

peytondmurray deleted the dataset-progress-bar-names branch January 27, 2023 17:24

c21 mentioned this pull request Jan 28, 2023

[Datasets] Remove the non-useful comment in map_batches() #32020

Merged

7 tasks

ericl pushed a commit that referenced this pull request Jan 31, 2023

[Datasets] Remove the non-useful comment in map_batches() (#32020)

34e2cd5

This PR is a quick fix to remove the non-useful comment introduced in #31526, probably during debugging. Replace the comment with a meaningful one.

This was referenced Feb 10, 2023

[Datasets] map_batches() function name is forced to camel case in Dataset.__repr__ #32410

Closed

[Datasets] Not change map_batches() UDF name in Dataset.__repr__ #32411

Merged

c21 mentioned this pull request Feb 10, 2023

[2.3][Datasets] Not change map_batches() UDF name in Dataset.__repr__ #32434

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Use function name for progress bars in ray dataset #31526

[Data] Use function name for progress bars in ray dataset #31526

peytondmurray commented Jan 8, 2023 •

edited

Loading

c21 commented Jan 18, 2023

peytondmurray commented Jan 18, 2023

peytondmurray commented Jan 19, 2023

peytondmurray commented Jan 21, 2023 •

edited

Loading

peytondmurray commented Jan 23, 2023

ericl commented Jan 26, 2023

peytondmurray commented Jan 26, 2023

[Data] Use function name for progress bars in ray dataset #31526

[Data] Use function name for progress bars in ray dataset #31526

Conversation

peytondmurray commented Jan 8, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

c21 commented Jan 18, 2023

peytondmurray commented Jan 18, 2023

peytondmurray commented Jan 19, 2023

peytondmurray commented Jan 21, 2023 • edited Loading

peytondmurray commented Jan 23, 2023

ericl commented Jan 26, 2023

peytondmurray commented Jan 26, 2023

peytondmurray commented Jan 8, 2023 •

edited

Loading

peytondmurray commented Jan 21, 2023 •

edited

Loading