-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR] Add experimental read_images
#28256
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lgtm ! This API looks much simpler that hides ImageSource away from user, and +1 on saving images as arrow type vs. pandas TensorArray.
Depending on the merging sequence you might want to double check on expected behavior of loading images of different sizes, afaik it should be supported after Clark's ragged tensor PR that adds variable shape arrow support.
Others are nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, mostly nits and small changes!
@@ -23,20 +23,17 @@ def preprocess(df: pd.DataFrame) -> pd.DataFrame: | |||
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), | |||
] | |||
) | |||
df["image"] = [preprocess(x).numpy() for x in df["image"]] | |||
return df | |||
return pd.DataFrame({"image": [preprocess(image) for image in batch]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're showing off a NumPy-only UDF, we shouldn't return a pandas DataFrame; instead, we can return a single ndarray (or dict of ndarrays, if we're wanting to change to a human-readable column name), which Datasets will convert back into a tabular format. This is both better UX for the UDF developer and should be more efficient under-the-hood (Datasets will represent the imagery tensor column in an Arrow Table rather than a Pandas DataFrame, which is more reliably zero-copy and has a smaller wire footprint).
return pd.DataFrame({"image": [preprocess(image) for image in batch]}) | |
return np.array([preprocess(image).numpy() for image in batch]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, could we match what @jiaodong is doing in their NumPy narrow waist for prediction PR, where the torchvision transform is vectorized over the input ndarray? That should be doable with the current API, just need to do the same transpose as in that PR: https://github.com/ray-project/ray/pull/28917/files#diff-e2bccb297d421f0dcff1892c4f23993064f52b17710787c41c3a2ae9dbc84159
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I.e. basically this:
def preprocess(image_batch: np.ndarray) -> np.ndarray:
"""
User Pytorch code to transform user image with outer dimension of batch size.
"""
preprocess = transforms.Compose(
[
# Torchvision's ToTensor does not accept outer batch dimension
transforms.CenterCrop(224),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
]
)
# Outer dimension is batch size such as (8, 256, 256, 3) -> (8, 3, 256, 256)
image_batch = torch.as_tensor(image_batch.transpose(0, 3, 1, 2))
return preprocess(transposed_torch_tensor).numpy()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't return an ndarray
, because then I get
Traceback (most recent call last):
File "/Users/balaji/Documents/GitHub/ray/doc/source/ray-air/examples/torch_image_batch_pretrained.py", line 39, in <module>
predictor.predict(dataset)
File "/Users/balaji/Documents/GitHub/ray/python/ray/train/batch_predictor.py", line 228, in predict
prediction_results = data.map_batches(
File "/Users/balaji/Documents/GitHub/ray/python/ray/data/dataset.py", line 561, in map_batches
return Dataset(plan, self._epoch, self._lazy)
File "/Users/balaji/Documents/GitHub/ray/python/ray/data/dataset.py", line 217, in __init__
self._plan.execute(allow_clear_input_blocks=False)
File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/plan.py", line 308, in execute
blocks, stage_info = stage(
File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/plan.py", line 662, in __call__
blocks = compute._apply(
File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/compute.py", line 378, in _apply
raise e
File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/compute.py", line 366, in _apply
new_metadata = ray.get(new_metadata)
File "/Users/balaji/Documents/GitHub/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/Users/balaji/Documents/GitHub/ray/python/ray/_private/worker.py", line 2279, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::BlockWorker.map_block_nosplit() (pid=18412, ip=127.0.0.1, repr=<ray.data._internal.compute.BlockWorker object at 0x11fa37010>)
File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/compute.py", line 274, in map_block_nosplit
return _map_block_nosplit(
File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/compute.py", line 439, in _map_block_nosplit
for new_block in block_fn(block, *fn_args, **fn_kwargs):
File "/Users/balaji/Documents/GitHub/ray/python/ray/data/dataset.py", line 523, in transform
applied = batch_fn(view, *fn_args, **fn_kwargs)
File "/Users/balaji/Documents/GitHub/ray/python/ray/train/batch_predictor.py", line 202, in __call__
prediction_output = self._predictor.predict(
File "/Users/balaji/Documents/GitHub/ray/python/ray/train/torch/torch_predictor.py", line 198, in predict
return super(TorchPredictor, self).predict(data=data, dtype=dtype)
File "/Users/balaji/Documents/GitHub/ray/python/ray/train/predictor.py", line 158, in predict
predictions_df = self._predict_pandas(data_df, **kwargs)
File "/Users/balaji/Documents/GitHub/ray/python/ray/train/_internal/dl_predictor.py", line 67, in _predict_pandas
tensors = convert_pandas_to_batch_type(
File "/Users/balaji/Documents/GitHub/ray/python/ray/air/util/data_batch_conversion.py", line 89, in convert_pandas_to_batch_type
data = _cast_ndarray_columns_to_tensor_extension(data)
File "/Users/balaji/Documents/GitHub/ray/python/ray/air/util/data_batch_conversion.py", line 217, in _cast_ndarray_columns_to_tensor_extension
for col_name, col in df.items():
AttributeError: 'numpy.ndarray' object has no attribute 'items'
Could we address this in a follow-up PR? I can create an issue to track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I forgot that the preprocessor is going to be applied within the predictor, which doesn't have the NumPy narrow waist merged yet.
Since we need to convert NumPy ndarray batches to pandas DataFrame batches with read_images()
now returning a tensor dataset, I suppose this is fine as-is, with the expectation that whichever PR is merged second will need to resolve merge conflicts and converge to what I gave above (ndarray in, ndarray out, vectorized torchvision transform).
root = "example://image-folders/simple" | ||
dataset = ray.data.read_datasource( | ||
ImageFolderDatasource(), root=root, size=(32, 32) | ||
class TestReadImages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Any reason for making this a test class rather than a flat set of module-level functions? Usually this is done if there's some shared setup or state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple reasons:
- Better organization.
test_dataset_formats
uses a flat structure, and related tests are often far apart. For example,read_text
andread_text_remote_args
were 1500 lines apart. By groupingread_images
tests in a class, we ensure thatread_images
-related tests are located close to each other. - Shorter test names.
test_invalid_size
is easier to read thantest_read_images_invalid_size
Also, we do something similar with the checkpoint tests, so there's atleast some precedence for this:
ray/python/ray/air/tests/test_checkpoints.py
Line 159 in 2217f0c
class CheckpointsConversionTest(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those points are understandable, but both could be solved (and is currently solved in this test_dataset_image
case) by scoping tests for a test module to a single abstraction, e.g. test_dataset_image
is for the ray.data.read_images()
API, so you wouldn't have the org issue and you could have shorter test names, since the context is provided by the module. E.g. right now you have a "test read images" module that has a single "test read images" class, which is a bit of a redundant hierarchy.
Also, note that CheckpointsConversionTest
has setup and teardown code, so the test class is actually doing some shared stateful initialization that's reused by all of the tests. Doing this kind of stateful setup and teardown code in a test class is actually considered an anti-pattern in pytest land, since that promotes poor test isolation; this is instead supposed to be done via test fixtures.
In any case, we might add non-ray.data.read_images()
test to this module and the redundant test class isn't adding much complexity, so I certainly wouldn't block on this, but I'd like to see us codify best practices in a guide before we start cargo culting anti-patterns or redundant hierarchies as Ray testing idioms.
IMO we should only group tests into a class if it benefits from one of the following motivators called out in the pytest docs:
- Test organization
- Sharing fixtures for tests only in that particular class
- Applying marks at the class level and having them implicitly apply to all tests
(1) only applies if logically grouping tests within a module and a nested grouping of tests within a class makes sense (doesn't apply here yet for the reasons I gave above), (2) is nice for avoiding polluting the module namespace or conftest.py with a fixture that's only needed for a small subgroup of tests (we haven't worried much about this so far, as you can tell from our conftest.py), and (3) is a pattern that I think we should adopt instead of having global scoped variables and reusing across parametrization decorators.
For a general guide, I'd be in favor of something like:
- source module --> test module
- source abstraction (function or class) --> test module if single abstraction in module, otherwise test class
- source class method --> test class (pytest supports nested test classes AFAIK)
I think that patterns of shared fixtures and the like should emerge naturally from that mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, we might add non-ray.data.read_images() test to this module
Yeah, this is what I'm concerned about.
If we add non-ray.data.read_images
tests, I doubt we're going to go back and group the read_images
tests in a class. Whereas, if we start with a class grouping, I figure people will pattern-match the existing tests and group new tests under a class. This should avoid the dataset_formats
situation where we have many disorganized tests.
For a general guide, I'd be in favor of something like:
- source module --> test module
- source abstraction (function or class) --> test module if single abstraction in module, otherwise test class
- source class method --> test class (pytest supports nested test classes AFAIK)
I agree with this except for the "test module if single abstraction in module". I think in practice people will always pattern match existing tests, so if we start with a flat structure people will continue to use that (even if we test unrelated abstractions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Should merge master since that may fix existing test failures.
Closing in favor of #29177 |
Signed-off-by: Balaji Veeramani [email protected]
Depends on:
partitioning
parameter toread_
functions #28413Why are these changes needed?
Users can't discover
ImageFolderDatasource
. This PR adds a more-discoverable way to read images.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.