Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] Add experimental read_images #28256

Closed

Conversation

bveeramani
Copy link
Member

@bveeramani bveeramani commented Sep 2, 2022

Signed-off-by: Balaji Veeramani [email protected]

Depends on:

Why are these changes needed?

Users can't discover ImageFolderDatasource. This PR adds a more-discoverable way to read images.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@bveeramani bveeramani marked this pull request as ready for review October 4, 2022 01:45
@bveeramani bveeramani added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 4, 2022
Copy link
Member

@jiaodong jiaodong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm ! This API looks much simpler that hides ImageSource away from user, and +1 on saving images as arrow type vs. pandas TensorArray.

Depending on the merging sequence you might want to double check on expected behavior of loading images of different sizes, afaik it should be supported after Clark's ragged tensor PR that adds variable shape arrow support.

Others are nits.

python/ray/data/datasource/image_datasource.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_dataset_image.py Show resolved Hide resolved
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, mostly nits and small changes!

@@ -23,20 +23,17 @@ def preprocess(df: pd.DataFrame) -> pd.DataFrame:
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
]
)
df["image"] = [preprocess(x).numpy() for x in df["image"]]
return df
return pd.DataFrame({"image": [preprocess(image) for image in batch]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're showing off a NumPy-only UDF, we shouldn't return a pandas DataFrame; instead, we can return a single ndarray (or dict of ndarrays, if we're wanting to change to a human-readable column name), which Datasets will convert back into a tabular format. This is both better UX for the UDF developer and should be more efficient under-the-hood (Datasets will represent the imagery tensor column in an Arrow Table rather than a Pandas DataFrame, which is more reliably zero-copy and has a smaller wire footprint).

Suggested change
return pd.DataFrame({"image": [preprocess(image) for image in batch]})
return np.array([preprocess(image).numpy() for image in batch])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, could we match what @jiaodong is doing in their NumPy narrow waist for prediction PR, where the torchvision transform is vectorized over the input ndarray? That should be doable with the current API, just need to do the same transpose as in that PR: https://github.com/ray-project/ray/pull/28917/files#diff-e2bccb297d421f0dcff1892c4f23993064f52b17710787c41c3a2ae9dbc84159

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. basically this:

def preprocess(image_batch: np.ndarray) -> np.ndarray:
    """
    User Pytorch code to transform user image with outer dimension of batch size.
    """
    preprocess = transforms.Compose(
        [
            # Torchvision's ToTensor does not accept outer batch dimension
            transforms.CenterCrop(224),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ]
    )
    # Outer dimension is batch size such as (8, 256, 256, 3) -> (8, 3, 256, 256)
    image_batch = torch.as_tensor(image_batch.transpose(0, 3, 1, 2))
    return preprocess(transposed_torch_tensor).numpy()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't return an ndarray, because then I get

Traceback (most recent call last):
  File "/Users/balaji/Documents/GitHub/ray/doc/source/ray-air/examples/torch_image_batch_pretrained.py", line 39, in <module>
    predictor.predict(dataset)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/train/batch_predictor.py", line 228, in predict
    prediction_results = data.map_batches(
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/dataset.py", line 561, in map_batches
    return Dataset(plan, self._epoch, self._lazy)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/dataset.py", line 217, in __init__
    self._plan.execute(allow_clear_input_blocks=False)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/plan.py", line 308, in execute
    blocks, stage_info = stage(
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/plan.py", line 662, in __call__
    blocks = compute._apply(
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/compute.py", line 378, in _apply
    raise e
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/compute.py", line 366, in _apply
    new_metadata = ray.get(new_metadata)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/_private/worker.py", line 2279, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::BlockWorker.map_block_nosplit() (pid=18412, ip=127.0.0.1, repr=<ray.data._internal.compute.BlockWorker object at 0x11fa37010>)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/compute.py", line 274, in map_block_nosplit
    return _map_block_nosplit(
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/compute.py", line 439, in _map_block_nosplit
    for new_block in block_fn(block, *fn_args, **fn_kwargs):
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/dataset.py", line 523, in transform
    applied = batch_fn(view, *fn_args, **fn_kwargs)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/train/batch_predictor.py", line 202, in __call__
    prediction_output = self._predictor.predict(
  File "/Users/balaji/Documents/GitHub/ray/python/ray/train/torch/torch_predictor.py", line 198, in predict
    return super(TorchPredictor, self).predict(data=data, dtype=dtype)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/train/predictor.py", line 158, in predict
    predictions_df = self._predict_pandas(data_df, **kwargs)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/train/_internal/dl_predictor.py", line 67, in _predict_pandas
    tensors = convert_pandas_to_batch_type(
  File "/Users/balaji/Documents/GitHub/ray/python/ray/air/util/data_batch_conversion.py", line 89, in convert_pandas_to_batch_type
    data = _cast_ndarray_columns_to_tensor_extension(data)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/air/util/data_batch_conversion.py", line 217, in _cast_ndarray_columns_to_tensor_extension
    for col_name, col in df.items():
AttributeError: 'numpy.ndarray' object has no attribute 'items'

Could we address this in a follow-up PR? I can create an issue to track.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I forgot that the preprocessor is going to be applied within the predictor, which doesn't have the NumPy narrow waist merged yet.

Since we need to convert NumPy ndarray batches to pandas DataFrame batches with read_images() now returning a tensor dataset, I suppose this is fine as-is, with the expectation that whichever PR is merged second will need to resolve merge conflicts and converge to what I gave above (ndarray in, ndarray out, vectorized torchvision transform).

python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Show resolved Hide resolved
root = "example://image-folders/simple"
dataset = ray.data.read_datasource(
ImageFolderDatasource(), root=root, size=(32, 32)
class TestReadImages:
Copy link
Contributor

@clarkzinzow clarkzinzow Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Any reason for making this a test class rather than a flat set of module-level functions? Usually this is done if there's some shared setup or state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple reasons:

  • Better organization. test_dataset_formats uses a flat structure, and related tests are often far apart. For example, read_text and read_text_remote_args were 1500 lines apart. By grouping read_images tests in a class, we ensure that read_images-related tests are located close to each other.
  • Shorter test names.test_invalid_size is easier to read than test_read_images_invalid_size

Also, we do something similar with the checkpoint tests, so there's atleast some precedence for this:

class CheckpointsConversionTest(unittest.TestCase):

Copy link
Contributor

@clarkzinzow clarkzinzow Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those points are understandable, but both could be solved (and is currently solved in this test_dataset_image case) by scoping tests for a test module to a single abstraction, e.g. test_dataset_image is for the ray.data.read_images() API, so you wouldn't have the org issue and you could have shorter test names, since the context is provided by the module. E.g. right now you have a "test read images" module that has a single "test read images" class, which is a bit of a redundant hierarchy.

Also, note that CheckpointsConversionTest has setup and teardown code, so the test class is actually doing some shared stateful initialization that's reused by all of the tests. Doing this kind of stateful setup and teardown code in a test class is actually considered an anti-pattern in pytest land, since that promotes poor test isolation; this is instead supposed to be done via test fixtures.

In any case, we might add non-ray.data.read_images() test to this module and the redundant test class isn't adding much complexity, so I certainly wouldn't block on this, but I'd like to see us codify best practices in a guide before we start cargo culting anti-patterns or redundant hierarchies as Ray testing idioms.

IMO we should only group tests into a class if it benefits from one of the following motivators called out in the pytest docs:

  1. Test organization
  2. Sharing fixtures for tests only in that particular class
  3. Applying marks at the class level and having them implicitly apply to all tests

(1) only applies if logically grouping tests within a module and a nested grouping of tests within a class makes sense (doesn't apply here yet for the reasons I gave above), (2) is nice for avoiding polluting the module namespace or conftest.py with a fixture that's only needed for a small subgroup of tests (we haven't worried much about this so far, as you can tell from our conftest.py), and (3) is a pattern that I think we should adopt instead of having global scoped variables and reusing across parametrization decorators.

For a general guide, I'd be in favor of something like:

  • source module --> test module
  • source abstraction (function or class) --> test module if single abstraction in module, otherwise test class
  • source class method --> test class (pytest supports nested test classes AFAIK)

I think that patterns of shared fixtures and the like should emerge naturally from that mapping.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, we might add non-ray.data.read_images() test to this module

Yeah, this is what I'm concerned about.

If we add non-ray.data.read_images tests, I doubt we're going to go back and group the read_images tests in a class. Whereas, if we start with a class grouping, I figure people will pattern-match the existing tests and group new tests under a class. This should avoid the dataset_formats situation where we have many disorganized tests.

For a general guide, I'd be in favor of something like:

  • source module --> test module
  • source abstraction (function or class) --> test module if single abstraction in module, otherwise test class
  • source class method --> test class (pytest supports nested test classes AFAIK)

I agree with this except for the "test module if single abstraction in module". I think in practice people will always pattern match existing tests, so if we start with a flat structure people will continue to use that (even if we test unrelated abstractions).

python/ray/data/tests/test_dataset_image.py Show resolved Hide resolved
python/ray/data/tests/test_dataset_image.py Show resolved Hide resolved
python/ray/data/datasource/image_datasource.py Outdated Show resolved Hide resolved
python/ray/data/datasource/image_datasource.py Outdated Show resolved Hide resolved
@matthewdeng matthewdeng added this to the Ray 2.1 milestone Oct 7, 2022
@c21 c21 added the Ray 2.1 label Oct 7, 2022
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Should merge master since that may fix existing test failures.

@clarkzinzow
Copy link
Contributor

Closing in favor of #29177

@clarkzinzow clarkzinzow closed this Oct 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ray 2.1 tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants