Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Use function name for progress bars in ray dataset #31526

Merged
merged 1 commit into from
Jan 27, 2023

Conversation

peytondmurray
Copy link
Contributor

@peytondmurray peytondmurray commented Jan 8, 2023

Why are these changes needed?

This PR changes map_batches to display an informative name in the progress bar that is created during calls to ray.data.dataset.Dataset.map_batches instead of a generic "map_batches" message, so that a user knows what function is being computed for a given display bar. If fn is a bound method of a Preprocessor subclass, the progress bar displays the name of the class. Otherwise the MapBatches(fn.__name__) is displayed; if the __name__ attribute is not present on the callable passed to map_batches, type(fn) is used instead; this happens if the callable is a class instance, for example.

Related issue number

Closes #31224.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@c21
Copy link
Contributor

c21 commented Jan 18, 2023

Hi @peytondmurray, thanks for opening up the PR! Could you rebase the PR to latest master, and fix the unit test failures? Thanks.

@peytondmurray
Copy link
Contributor Author

Yep, working on the test failures now. Unfortunately it seems like stdout is being examined in many tests across the code base, so anything that calls map_batches needs to be modified. Might be something to consider looking at again in the future if we want to refactor some of the tests.

@peytondmurray peytondmurray force-pushed the dataset-progress-bar-names branch 3 times, most recently from b78bd4d to 81a0850 Compare January 19, 2023 07:00
@peytondmurray
Copy link
Contributor Author

Okay, I think test failures appear to be unrelated.

@peytondmurray peytondmurray marked this pull request as ready for review January 19, 2023 09:19
@peytondmurray peytondmurray changed the title [WIP] Use function name for progress bars in ray dataset [Data] Use function name for progress bars in ray dataset Jan 19, 2023
python/ray/air/tests/test_dataset_config.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/air/tests/test_dataset_config.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_dataset.py Outdated Show resolved Hide resolved
@ericl ericl self-assigned this Jan 19, 2023
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 19, 2023
.bazeliskrc Show resolved Hide resolved
python/ray/air/tests/test_dataset_config.py Outdated Show resolved Hide resolved
python/ray/air/tests/test_dataset_config.py Outdated Show resolved Hide resolved
@peytondmurray
Copy link
Contributor Author

peytondmurray commented Jan 21, 2023

@bveeramani Love the idea. There's a little trouble in that the function that gets passed to Dataset.map_batches when a preprocessor is specified to DataParallelTrainer (as is the case in ray/python/ray/air/tests/test_dataset_config.py::test_stream_inf_window_cache_prep) is a <bound method BatchMapper._transform_pandas of BatchMapper(fn=rand)> - that is, it's a method of a Preprocessor subclass, not the subclass itself. This means that's it's not as straightforward to check whether the mapping function is a Preprocessor method.

It might be possible to do this by examining fn.__qualname__ to get the fully qualified name; then strip the function name off the end; then systematically test each segment to see if it is in globals() to look for the class the method is attached to. Then we can call issubclass(cls, Preprocessor) to definitively identify if it is a Preprocessor and to retrieve the correct class name. Maybe there's a simpler way of doing this?

Edit Okay, so one alternative I came up with is to attach the BatchMapper class as an attribute to the function being passed to map_batches to allow me to use it in status messages there. Maybe this is acceptable?

Edit 2 Never mind, you can get the class name from a bound method with __self__. Implementing now!

@peytondmurray peytondmurray force-pushed the dataset-progress-bar-names branch 4 times, most recently from 74ac9d2 to 68bb2e1 Compare January 22, 2023 00:28
@peytondmurray
Copy link
Contributor Author

@bveeramani @ericl Preprocessors now correctly show their class names; here's what the progress bars currently look like:

$ pytest -s test_batch_predictor.py::test_separate_gpu_stage
Test session starts (platform: linux, Python 3.10.8, pytest 7.2.0, pytest-sugar 0.9.6)
rootdir: /home/pdmurray/Desktop/workspace/ray, configfile: pytest.ini
plugins: clarity-1.0.1, sugar-0.9.6, typeguard-2.13.3, anyio-3.6.2
collecting ... 2023-01-23 09:14:16,764	INFO worker.py:1546 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
2023-01-23 09:14:18,290	WARNING read_api.py:330 -- ⚠️  The number of blocks in this dataset (10) limits its parallelism to 10 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks.
2023-01-23 09:14:18,298	INFO batch_predictor.py:184 -- `num_gpus_per_worker` is set for `BatchPreditor`.Automatically enabling GPU prediction for this predictor. To disable set `use_gpu` to `False` in `BatchPredictor.predict`.
Read->Dummypreprocessor: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 12.83it/s]
2023-01-23 09:14:19,096	WARNING compute.py:543 -- `batch_size` is set to 4096, which reduces parallelism from 10 to 1. If the performance is worse than expected, this may indicate that the batch size is too large or the input block size is too small. To reduce batch size, consider decreasing `batch_size` or use the default in `map_batches`. To increase input block size, consider decreasing `parallelism` in read.
Map Progress (1 actors 1 pending): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.01s/it]
Shuffle Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 109.23it/s]
Shuffle Reduce: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 195.17it/s]
2023-01-23 09:14:20,129	WARNING read_api.py:330 -- ⚠️  The number of blocks in this dataset (10) limits its parallelism to 10 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks.
Read: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1568.43it/s]
2023-01-23 09:14:20,148	WARNING compute.py:543 -- `batch_size` is set to 4096, which reduces parallelism from 10 to 1. If the performance is worse than expected, this may indicate that the batch size is too large or the input block size is too small. To reduce batch size, consider decreasing `batch_size` or use the default in `map_batches`. To increase input block size, consider decreasing `parallelism` in read.
Map Progress (1 actors 1 pending): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.98s/it]
Shuffle Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 155.07it/s]
Shuffle Reduce: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 275.09it/s]

 python/ray/train/tests/test_batch_predictor.py ✓                                                                                                                                                  100% ██████████

Results (10.72s):
       1 passed

@ericl
Copy link
Contributor

ericl commented Jan 26, 2023

Oh title() is super weird, I don't know why anyone would want it force-lowercasing the other chars. How about we use this replacement function instead?

>>> def capfirst(s):
...     return s[:1].upper() + s[1:]
... 
>>> def capitalize(s):
...    return "".join(capfirst(x) for x in s.split("_"))
... 
>>> capitalize("map_batches")
'MapBatches'
>>> capitalize(capitalize("map_batches"))
'MapBatches'

@peytondmurray
Copy link
Contributor Author

Sure, I'll try this and see what the impact on other tests is.

@peytondmurray peytondmurray force-pushed the dataset-progress-bar-names branch 2 times, most recently from 29a0e8a to 70d40f3 Compare January 26, 2023 20:35
@peytondmurray peytondmurray removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 27, 2023
@ericl ericl merged commit 3343c76 into ray-project:master Jan 27, 2023
@peytondmurray peytondmurray deleted the dataset-progress-bar-names branch January 27, 2023 17:24
ericl pushed a commit that referenced this pull request Jan 31, 2023
This PR is a quick fix to remove the non-useful comment introduced in #31526, probably during debugging.

Replace the comment with a meaningful one.
clarng pushed a commit to clarng/ray that referenced this pull request Jan 31, 2023
…ect#32020)

This PR is a quick fix to remove the non-useful comment introduced in ray-project#31526, probably during debugging.

Replace the comment with a meaningful one.
clarkzinzow pushed a commit that referenced this pull request Feb 10, 2023
…32411)

This is to fix the Dataset.__repr__ issue in #32410, after we introduce function name in #31526. We should only make operator/stage name to be camel case.

Signed-off-by: Cheng Su <[email protected]>
c21 added a commit to c21/ray that referenced this pull request Feb 10, 2023
…ay-project#32411)

This is to fix the Dataset.__repr__ issue in ray-project#32410, after we introduce function name in ray-project#31526. We should only make operator/stage name to be camel case.

Signed-off-by: Cheng Su <[email protected]>
cadedaniel pushed a commit that referenced this pull request Feb 10, 2023
…32411) (#32434)

This is to fix the Dataset.__repr__ issue in #32410, after we introduce function name in #31526. We should only make operator/stage name to be camel case.

Signed-off-by: Cheng Su <[email protected]>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
Signed-off-by: pdmurray <[email protected]>

Signed-off-by: pdmurray <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…ect#32020)

This PR is a quick fix to remove the non-useful comment introduced in ray-project#31526, probably during debugging.

Replace the comment with a meaningful one.

Signed-off-by: Edward Oakes <[email protected]>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…ay-project#32411)

This is to fix the Dataset.__repr__ issue in ray-project#32410, after we introduce function name in ray-project#31526. We should only make operator/stage name to be camel case.

Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] Give dataset progress bars descriptive names
5 participants