[AIR] Add experimental `read_images` #28256

bveeramani · 2022-09-02T00:25:43Z

Signed-off-by: Balaji Veeramani [email protected]

Depends on:

[Datasets] Add partitioning parameter to read_ functions #28413

Why are these changes needed?

Users can't discover ImageFolderDatasource. This PR adds a more-discoverable way to read images.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/data/read_api.py

jiaodong

Overall lgtm ! This API looks much simpler that hides ImageSource away from user, and +1 on saving images as arrow type vs. pandas TensorArray.

Depending on the merging sequence you might want to double check on expected behavior of loading images of different sizes, afaik it should be supported after Clark's ragged tensor PR that adds variable shape arrow support.

Others are nits.

python/ray/data/datasource/image_datasource.py

python/ray/data/tests/test_dataset_image.py

clarkzinzow

LGTM overall, mostly nits and small changes!

clarkzinzow · 2022-10-05T18:26:01Z

doc/source/ray-air/examples/torch_image_batch_pretrained.py

@@ -23,20 +23,17 @@ def preprocess(df: pd.DataFrame) -> pd.DataFrame:
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ]
    )
-    df["image"] = [preprocess(x).numpy() for x in df["image"]]
-    return df
+    return pd.DataFrame({"image": [preprocess(image) for image in batch]})


If we're showing off a NumPy-only UDF, we shouldn't return a pandas DataFrame; instead, we can return a single ndarray (or dict of ndarrays, if we're wanting to change to a human-readable column name), which Datasets will convert back into a tabular format. This is both better UX for the UDF developer and should be more efficient under-the-hood (Datasets will represent the imagery tensor column in an Arrow Table rather than a Pandas DataFrame, which is more reliably zero-copy and has a smaller wire footprint).

Suggested change

return pd.DataFrame({"image": [preprocess(image) for image in batch]})

return np.array([preprocess(image).numpy() for image in batch])

In fact, could we match what @jiaodong is doing in their NumPy narrow waist for prediction PR, where the torchvision transform is vectorized over the input ndarray? That should be doable with the current API, just need to do the same transpose as in that PR: https://github.com/ray-project/ray/pull/28917/files#diff-e2bccb297d421f0dcff1892c4f23993064f52b17710787c41c3a2ae9dbc84159

I.e. basically this:

def preprocess(image_batch: np.ndarray) -> np.ndarray: """ User Pytorch code to transform user image with outer dimension of batch size. """ preprocess = transforms.Compose( [ # Torchvision's ToTensor does not accept outer batch dimension transforms.CenterCrop(224), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ] ) # Outer dimension is batch size such as (8, 256, 256, 3) -> (8, 3, 256, 256) image_batch = torch.as_tensor(image_batch.transpose(0, 3, 1, 2)) return preprocess(transposed_torch_tensor).numpy()

I can't return an ndarray, because then I get

Traceback (most recent call last): File "/Users/balaji/Documents/GitHub/ray/doc/source/ray-air/examples/torch_image_batch_pretrained.py", line 39, in <module> predictor.predict(dataset) File "/Users/balaji/Documents/GitHub/ray/python/ray/train/batch_predictor.py", line 228, in predict prediction_results = data.map_batches( File "/Users/balaji/Documents/GitHub/ray/python/ray/data/dataset.py", line 561, in map_batches return Dataset(plan, self._epoch, self._lazy) File "/Users/balaji/Documents/GitHub/ray/python/ray/data/dataset.py", line 217, in __init__ self._plan.execute(allow_clear_input_blocks=False) File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/plan.py", line 308, in execute blocks, stage_info = stage( File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/plan.py", line 662, in __call__ blocks = compute._apply( File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/compute.py", line 378, in _apply raise e File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/compute.py", line 366, in _apply new_metadata = ray.get(new_metadata) File "/Users/balaji/Documents/GitHub/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/Users/balaji/Documents/GitHub/ray/python/ray/_private/worker.py", line 2279, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(AttributeError): ray::BlockWorker.map_block_nosplit() (pid=18412, ip=127.0.0.1, repr=<ray.data._internal.compute.BlockWorker object at 0x11fa37010>) File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/compute.py", line 274, in map_block_nosplit return _map_block_nosplit( File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/compute.py", line 439, in _map_block_nosplit for new_block in block_fn(block, *fn_args, **fn_kwargs): File "/Users/balaji/Documents/GitHub/ray/python/ray/data/dataset.py", line 523, in transform applied = batch_fn(view, *fn_args, **fn_kwargs) File "/Users/balaji/Documents/GitHub/ray/python/ray/train/batch_predictor.py", line 202, in __call__ prediction_output = self._predictor.predict( File "/Users/balaji/Documents/GitHub/ray/python/ray/train/torch/torch_predictor.py", line 198, in predict return super(TorchPredictor, self).predict(data=data, dtype=dtype) File "/Users/balaji/Documents/GitHub/ray/python/ray/train/predictor.py", line 158, in predict predictions_df = self._predict_pandas(data_df, **kwargs) File "/Users/balaji/Documents/GitHub/ray/python/ray/train/_internal/dl_predictor.py", line 67, in _predict_pandas tensors = convert_pandas_to_batch_type( File "/Users/balaji/Documents/GitHub/ray/python/ray/air/util/data_batch_conversion.py", line 89, in convert_pandas_to_batch_type data = _cast_ndarray_columns_to_tensor_extension(data) File "/Users/balaji/Documents/GitHub/ray/python/ray/air/util/data_batch_conversion.py", line 217, in _cast_ndarray_columns_to_tensor_extension for col_name, col in df.items(): AttributeError: 'numpy.ndarray' object has no attribute 'items'

Could we address this in a follow-up PR? I can create an issue to track.

Ah, I forgot that the preprocessor is going to be applied within the predictor, which doesn't have the NumPy narrow waist merged yet.

Since we need to convert NumPy ndarray batches to pandas DataFrame batches with read_images() now returning a tensor dataset, I suppose this is fine as-is, with the expectation that whichever PR is merged second will need to resolve merge conflicts and converge to what I gave above (ndarray in, ndarray out, vectorized torchvision transform).

python/ray/data/read_api.py

clarkzinzow · 2022-10-05T18:48:35Z

python/ray/data/tests/test_dataset_image.py

-    root = "example://image-folders/simple"
-    dataset = ray.data.read_datasource(
-        ImageFolderDatasource(), root=root, size=(32, 32)
+class TestReadImages:


Nit: Any reason for making this a test class rather than a flat set of module-level functions? Usually this is done if there's some shared setup or state.

Couple reasons:

Better organization. test_dataset_formats uses a flat structure, and related tests are often far apart. For example, read_text and read_text_remote_args were 1500 lines apart. By grouping read_images tests in a class, we ensure that read_images-related tests are located close to each other.

Shorter test names.test_invalid_size is easier to read than test_read_images_invalid_size

Also, we do something similar with the checkpoint tests, so there's atleast some precedence for this:

ray/python/ray/air/tests/test_checkpoints.py

Line 159 in 2217f0c

class CheckpointsConversionTest(unittest.TestCase):

Those points are understandable, but both could be solved (and is currently solved in this test_dataset_image case) by scoping tests for a test module to a single abstraction, e.g. test_dataset_image is for the ray.data.read_images() API, so you wouldn't have the org issue and you could have shorter test names, since the context is provided by the module. E.g. right now you have a "test read images" module that has a single "test read images" class, which is a bit of a redundant hierarchy.

Also, note that CheckpointsConversionTest has setup and teardown code, so the test class is actually doing some shared stateful initialization that's reused by all of the tests. Doing this kind of stateful setup and teardown code in a test class is actually considered an anti-pattern in pytest land, since that promotes poor test isolation; this is instead supposed to be done via test fixtures.

In any case, we might add non-ray.data.read_images() test to this module and the redundant test class isn't adding much complexity, so I certainly wouldn't block on this, but I'd like to see us codify best practices in a guide before we start cargo culting anti-patterns or redundant hierarchies as Ray testing idioms.

IMO we should only group tests into a class if it benefits from one of the following motivators called out in the pytest docs:

Test organization

Sharing fixtures for tests only in that particular class

Applying marks at the class level and having them implicitly apply to all tests

(1) only applies if logically grouping tests within a module and a nested grouping of tests within a class makes sense (doesn't apply here yet for the reasons I gave above), (2) is nice for avoiding polluting the module namespace or conftest.py with a fixture that's only needed for a small subgroup of tests (we haven't worried much about this so far, as you can tell from our conftest.py), and (3) is a pattern that I think we should adopt instead of having global scoped variables and reusing across parametrization decorators.

For a general guide, I'd be in favor of something like:

source module --> test module

source abstraction (function or class) --> test module if single abstraction in module, otherwise test class

source class method --> test class (pytest supports nested test classes AFAIK)

I think that patterns of shared fixtures and the like should emerge naturally from that mapping.

In any case, we might add non-ray.data.read_images() test to this module

Yeah, this is what I'm concerned about.

If we add non-ray.data.read_images tests, I doubt we're going to go back and group the read_images tests in a class. Whereas, if we start with a class grouping, I figure people will pattern-match the existing tests and group new tests under a class. This should avoid the dataset_formats situation where we have many disorganized tests.

For a general guide, I'd be in favor of something like:

source module --> test module

source abstraction (function or class) --> test module if single abstraction in module, otherwise test class

source class method --> test class (pytest supports nested test classes AFAIK)

I agree with this except for the "test module if single abstraction in module". I think in practice people will always pattern match existing tests, so if we start with a flat structure people will continue to use that (even if we test unrelated abstractions).

python/ray/data/tests/test_dataset_image.py

python/ray/data/datasource/image_datasource.py

clarkzinzow

LGTM! Should merge master since that may fix existing test failures.

clarkzinzow · 2022-10-07T20:58:41Z

Closing in favor of #29177

Add experimental read_images

5e50b46

bveeramani requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix, maxpumperla, c21 and a team as code owners September 2, 2022 00:25

bveeramani marked this pull request as draft September 2, 2022 00:25

bveeramani added 2 commits September 6, 2022 09:52

Merge branch 'master' into bveeramani/read-images

675ca6c

Mark as experimental

b8d3974

richardliaw reviewed Sep 6, 2022

View reviewed changes

python/ray/data/read_api.py Outdated Show resolved Hide resolved

richardliaw reviewed Sep 6, 2022

View reviewed changes

python/ray/data/read_api.py Show resolved Hide resolved

richardliaw reviewed Sep 6, 2022

View reviewed changes

python/ray/data/read_api.py Outdated Show resolved Hide resolved

richardliaw reviewed Sep 6, 2022

View reviewed changes

python/ray/data/read_api.py Show resolved Hide resolved

bveeramani added 8 commits September 8, 2022 23:15

Rename PathPartitionScheme as Partitioning

4f1d5d7

Update input_output.rst

9afc041

Update partitioning.py

d6b2667

Update partitioning.py

517c390

Add CSV tests

d7a2ae3

Merge remote-tracking branch 'upstream/master' into bveeramani/partition

9416d3c

Merge remote-tracking branch 'upstream/master' into bveeramani/partition

e9a9c5c

Support None field name

644878f

bveeramani mentioned this pull request Sep 9, 2022

[Datasets] Support None partition field name #28417

Merged

7 tasks

bveeramani added 5 commits September 9, 2022 14:33

Update test_partitioning.py

9c65eb9

Merge branch 'bveeramani/dir-partitioning' into bveeramani/partition

7372987

Merge stuff

6980079

Move code to FileBasedDatasource

2253c47

Delete tmp.csv

d34acc9

bveeramani added 11 commits September 29, 2022 12:24

Appease lint

50f99ca

Merge branch 'bveeramani/partition' into bveeramani/read-images

b1d9b33

Delete model

2f65750

Update pytorch_training_e2e.py

2d23510

Merge branch 'master' into bveeramani/read-images

3138d7b

Appease lint

2dfd0fd

Minor fixes

ad8f81c

Update documentation

151309b

Remove references

d827ccb

Update creating-datasets.rst

5d6af8b

Update read_benchmark.py

46f0292

bveeramani marked this pull request as ready for review October 4, 2022 01:45

bveeramani assigned clarkzinzow Oct 4, 2022

bveeramani added 3 commits October 3, 2022 18:46

Minor fixes

9c1c277

Fix CI

208089b

Update read_api.py

ddd342f

bveeramani added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 4, 2022

jiaodong reviewed Oct 4, 2022

View reviewed changes

python/ray/data/datasource/image_datasource.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_dataset_image.py Show resolved Hide resolved

clarkzinzow reviewed Oct 5, 2022

View reviewed changes

bveeramani added 4 commits October 5, 2022 22:54

Address review comments

0dc6dbe

Merge branch 'master' into bveeramani/read-images

0bf9734

Merge branch 'master' into bveeramani/read-images

bdae9f4

Update test_dataset_image.py

4c98cf8

matthewdeng added this to the Ray 2.1 milestone Oct 7, 2022

c21 added the Ray 2.1 label Oct 7, 2022

clarkzinzow approved these changes Oct 7, 2022

View reviewed changes

clarkzinzow mentioned this pull request Oct 7, 2022

[Proxy PR] [AIR - Datasets] Add experimental read_images #29177

Merged

7 tasks

c21 approved these changes Oct 7, 2022

View reviewed changes

clarkzinzow closed this Oct 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Add experimental `read_images` #28256

[AIR] Add experimental `read_images` #28256

bveeramani commented Sep 2, 2022 •

edited

Loading

jiaodong left a comment

clarkzinzow left a comment

clarkzinzow Oct 5, 2022

clarkzinzow Oct 5, 2022

clarkzinzow Oct 5, 2022

bveeramani Oct 6, 2022

clarkzinzow Oct 6, 2022

clarkzinzow Oct 5, 2022 •

edited

Loading

bveeramani Oct 6, 2022

clarkzinzow Oct 6, 2022 •

edited

Loading

bveeramani Oct 6, 2022

clarkzinzow left a comment

clarkzinzow commented Oct 7, 2022

	return pd.DataFrame({"image": [preprocess(image) for image in batch]})
	return np.array([preprocess(image).numpy() for image in batch])

[AIR] Add experimental read_images #28256

[AIR] Add experimental read_images #28256

Conversation

bveeramani commented Sep 2, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

jiaodong left a comment

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow Oct 5, 2022

Choose a reason for hiding this comment

clarkzinzow Oct 5, 2022

Choose a reason for hiding this comment

clarkzinzow Oct 5, 2022

Choose a reason for hiding this comment

bveeramani Oct 6, 2022

Choose a reason for hiding this comment

clarkzinzow Oct 6, 2022

Choose a reason for hiding this comment

clarkzinzow Oct 5, 2022 • edited Loading

Choose a reason for hiding this comment

bveeramani Oct 6, 2022

Choose a reason for hiding this comment

clarkzinzow Oct 6, 2022 • edited Loading

Choose a reason for hiding this comment

bveeramani Oct 6, 2022

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow commented Oct 7, 2022

[AIR] Add experimental `read_images` #28256

[AIR] Add experimental `read_images` #28256

bveeramani commented Sep 2, 2022 •

edited

Loading

clarkzinzow Oct 5, 2022 •

edited

Loading

clarkzinzow Oct 6, 2022 •

edited

Loading