Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] Add experimental read_images #28256

Closed
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
5e50b46
Add experimental `read_images`
bveeramani Sep 2, 2022
675ca6c
Merge branch 'master' into bveeramani/read-images
bveeramani Sep 6, 2022
b8d3974
Mark as experimental
bveeramani Sep 6, 2022
4f1d5d7
Rename `PathPartitionScheme` as `Partitioning`
bveeramani Sep 9, 2022
9afc041
Update input_output.rst
bveeramani Sep 9, 2022
d6b2667
Update partitioning.py
bveeramani Sep 9, 2022
517c390
Update partitioning.py
bveeramani Sep 9, 2022
d7a2ae3
Add CSV tests
bveeramani Sep 9, 2022
9416d3c
Merge remote-tracking branch 'upstream/master' into bveeramani/partition
bveeramani Sep 9, 2022
e9a9c5c
Merge remote-tracking branch 'upstream/master' into bveeramani/partition
bveeramani Sep 9, 2022
644878f
Support `None` field name
bveeramani Sep 9, 2022
9c65eb9
Update test_partitioning.py
bveeramani Sep 9, 2022
7372987
Merge branch 'bveeramani/dir-partitioning' into bveeramani/partition
bveeramani Sep 9, 2022
6980079
Merge stuff
bveeramani Sep 9, 2022
2253c47
Move code to `FileBasedDatasource`
bveeramani Sep 9, 2022
d34acc9
Delete tmp.csv
bveeramani Sep 9, 2022
0cfeb58
Merge remote-tracking branch 'upstream/master' into bveeramani/partition
bveeramani Sep 15, 2022
38ba956
Add files
bveeramani Sep 15, 2022
308bc68
Appease lint
bveeramani Sep 15, 2022
a8432e4
Update csv_datasource.py
bveeramani Sep 15, 2022
b5657a8
Delete test_csv_partitioning.py
bveeramani Sep 15, 2022
f96a498
Update file_based_datasource.py
bveeramani Sep 15, 2022
44ec745
Rename
bveeramani Sep 15, 2022
00aac7d
Make changes
bveeramani Sep 15, 2022
a2f2ab0
Appease lint
bveeramani Sep 15, 2022
3fd0aac
Update read_api.py
bveeramani Sep 15, 2022
e0cb06a
Add Numpy
bveeramani Sep 15, 2022
4f08b73
Update files
bveeramani Sep 15, 2022
a839514
Update read_api.py
bveeramani Sep 16, 2022
fc087f1
Update files
bveeramani Sep 16, 2022
bca3925
Merge remote-tracking branch 'upstream/master' into bveeramani/read-i…
bveeramani Sep 16, 2022
5f7ea9f
Merge branch 'bveeramani/partition' into bveeramani/read-images
bveeramani Sep 16, 2022
34b016f
Update read_api.py
bveeramani Sep 19, 2022
e4eb840
Update error messages
bveeramani Sep 19, 2022
3f1c361
Temp
bveeramani Sep 19, 2022
9924029
Merge branch 'bveeramani/partition' into bveeramani/read-images
bveeramani Sep 19, 2022
5d7b7fe
Update files
bveeramani Sep 19, 2022
e4a2cb9
Bug fix and lint
bveeramani Sep 19, 2022
0715fc8
Update files
bveeramani Sep 19, 2022
d7fccfa
Appease lint and fix install
bveeramani Sep 19, 2022
7f88436
Merge branch 'bveeramani/partition' into bveeramani/read-images
bveeramani Sep 19, 2022
edf1b9f
Fix parameter
bveeramani Sep 19, 2022
578edc2
Update creating-datasets.rst
bveeramani Sep 19, 2022
249bafc
Fix test
bveeramani Sep 20, 2022
27d9a59
Address review comments
bveeramani Sep 23, 2022
c993f2d
Update test_dataset_formats.py
bveeramani Sep 23, 2022
65dc78f
Merge branch 'master' into bveeramani/partition
bveeramani Sep 23, 2022
92d6af5
Update test_dataset_formats.py
bveeramani Sep 23, 2022
8dc0501
Update test_dataset_formats.py
bveeramani Sep 23, 2022
343c995
Merge branch 'master' into bveeramani/partition
bveeramani Sep 26, 2022
29ed734
Update test_dataset_formats.py
bveeramani Sep 26, 2022
0ef5585
Update python/ray/data/datasource/text_datasource.py
bveeramani Sep 28, 2022
2fb3451
Update python/ray/data/tests/test_dataset_formats.py
bveeramani Sep 28, 2022
baf096e
Address review comments
bveeramani Sep 28, 2022
a3d5729
Update test_partitioning.py
bveeramani Sep 28, 2022
ef2e79e
Address review comments
bveeramani Sep 28, 2022
fbf2bb1
Merge remote-tracking branch 'upstream/master' into bveeramani/partition
bveeramani Sep 28, 2022
01be922
Merge branch 'master' into bveeramani/read-images
bveeramani Sep 29, 2022
6f6855d
Update test_dataset_image.py
bveeramani Sep 29, 2022
c3cdf7b
Merge branch 'master' into bveeramani/partition
bveeramani Sep 29, 2022
5eaa52b
Tests
bveeramani Sep 29, 2022
0604d3a
Delete x.npy
bveeramani Sep 29, 2022
50f99ca
Appease lint
bveeramani Sep 29, 2022
b1d9b33
Merge branch 'bveeramani/partition' into bveeramani/read-images
bveeramani Sep 29, 2022
2f65750
Delete model
bveeramani Sep 29, 2022
2d23510
Update pytorch_training_e2e.py
bveeramani Sep 29, 2022
3138d7b
Merge branch 'master' into bveeramani/read-images
bveeramani Oct 4, 2022
2dfd0fd
Appease lint
bveeramani Oct 4, 2022
ad8f81c
Minor fixes
bveeramani Oct 4, 2022
151309b
Update documentation
bveeramani Oct 4, 2022
d827ccb
Remove references
bveeramani Oct 4, 2022
5d6af8b
Update creating-datasets.rst
bveeramani Oct 4, 2022
46f0292
Update read_benchmark.py
bveeramani Oct 4, 2022
9c1c277
Minor fixes
bveeramani Oct 4, 2022
208089b
Fix CI
bveeramani Oct 4, 2022
ddd342f
Update read_api.py
bveeramani Oct 4, 2022
0dc6dbe
Address review comments
bveeramani Oct 6, 2022
0bf9734
Merge branch 'master' into bveeramani/read-images
bveeramani Oct 6, 2022
bdae9f4
Merge branch 'master' into bveeramani/read-images
bveeramani Oct 6, 2022
4c98cf8
Update test_dataset_image.py
bveeramani Oct 6, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added doc/model
Binary file not shown.
7 changes: 6 additions & 1 deletion doc/source/data/api/input_output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,11 @@ Text

.. autofunction:: ray.data.read_text

Images (experimental)
---------------------

.. autofunction:: ray.data.read_images

Binary
------

Expand Down Expand Up @@ -214,4 +219,4 @@ MetadataProvider API
:members:

.. autoclass:: ray.data.datasource.FastFileMetadataProvider
:members:
:members:
37 changes: 24 additions & 13 deletions doc/source/data/creating-datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,30 @@ Supported File Formats

See the API docs for :func:`read_text() <ray.data.read_text>`.

.. tabbed:: Images (experimental)

If your directory structure is:

.. code-block::

root/dog/xxx.png
root/dog/xxy.png
root/dog/[...]/xxz.png

root/cat/123.png
root/cat/nsdf3.png
root/cat/[...]/asd932_.png

Then call :func:`~ray.data.read_images` to load your images into a ``Dataset``.

.. literalinclude:: ./doc_code/creating_datasets.py
:language: python
:start-after: __read_images_begin__
:end-before: __read_images_end__

For more information on working with tensors, see our
:ref:`tensor data guide <datasets_tensor_support>`

.. tabbed:: Binary

Read binary files into a ``Dataset``. Each binary file will be treated as a single row
Expand Down Expand Up @@ -518,19 +542,6 @@ converts it into a Ray Dataset directly.
ray_datasets["train"].take(2)
# [{'text': ''}, {'text': ' = Valkyria Chronicles III = \n'}]

.. _datasets_from_images:

-------------------------------
From Image Files (experimental)
-------------------------------

Load image data stored as individual files using :py:class:`~ray.data.datasource.ImageFolderDatasource`:

.. literalinclude:: ./doc_code/tensor.py
:language: python
:start-after: __create_images_begin__
:end-before: __create_images_end__

.. _datasets_custom_datasource:

------------------
Expand Down
2 changes: 1 addition & 1 deletion doc/source/data/dataset-tensor-support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ This section shows how to create single and multi-column Tensor datasets.

.. tabbed:: Images (experimental)

Load image data stored as individual files using :py:class:`~ray.data.datasource.ImageFolderDatasource`:
Load image data stored as individual files using :func:`~ray.data.read_images`:

**Image and label columns**:

Expand Down
5 changes: 4 additions & 1 deletion doc/source/data/dataset.rst
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ Advanced users can refer directly to the Ray Datasets :ref:`API reference <data-
:text: Start Using Ray Datasets
:classes: btn-outline-info btn-block
---

**Examples**
^^^

Expand Down Expand Up @@ -200,6 +200,9 @@ Supported Input Formats
* - Text Files
- :func:`ray.data.read_text()`
- ✅
* - Image Files (experimental)
- :func:`ray.data.read_images()`
- 🚧
* - Binary Files
- :func:`ray.data.read_binary_files()`
- ✅
Expand Down
27 changes: 27 additions & 0 deletions doc/source/data/doc_code/creating_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,33 @@
# __from_numpy_end__
# fmt: on

# fmt: off
# __read_images_begin__
ds = ray.data.read_images(root="example://image-folders/simple", size=(128, 128))
# -> Dataset(num_blocks=3, num_rows=3,
# schema={image: TensorDtype(shape=(128, 128, 3), dtype=uint8),
# label: object})

ds.take(1)
# -> [{'image':
# array([[[ 92, 71, 57],
# [107, 87, 72],
# ...,
# [141, 161, 185],
# [139, 158, 184]],
#
# ...,
#
# [[135, 135, 109],
# [135, 135, 108],
# ...,
# [167, 150, 89],
# [165, 146, 90]]], dtype=uint8),
# 'label': 'cat',
# }]
# __read_images_end__
# fmt: on

# fmt: off
# __from_numpy_mult_begin__
import numpy as np
Expand Down
5 changes: 1 addition & 4 deletions doc/source/data/doc_code/tensor.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,10 +194,7 @@ def cast_udf(block: pa.Table) -> pa.Table:
ds.fully_executed()

# __create_images_begin__
from ray.data.datasource import ImageFolderDatasource

ds = ray.data.read_datasource(
ImageFolderDatasource(), root="example://image-folders/simple", size=(128, 128))
ds = ray.data.read_images(root="example://image-folders/simple", size=(128, 128))
# -> Dataset(num_blocks=3, num_rows=3,
# schema={image: TensorDtype(shape=(128, 128, 3), dtype=uint8),
# label: object})
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
from ray.train.torch import TorchCheckpoint, TorchPredictor
from ray.train.batch_predictor import BatchPredictor
from ray.data.preprocessors import BatchMapper
from ray.data.datasource import ImageFolderDatasource


def preprocess(df: pd.DataFrame) -> pd.DataFrame:
Expand All @@ -29,9 +28,7 @@ def preprocess(df: pd.DataFrame) -> pd.DataFrame:

data_url = "s3://anonymous@air-example-data-2/1G-image-data-synthetic-raw"
print(f"Running GPU batch prediction with 1GB data from {data_url}")
dataset = ray.data.read_datasource(
ImageFolderDatasource(), root=data_url, size=(256, 256)
)
dataset = ray.data.read_images(root=data_url, size=(256, 256))

model = resnet18(pretrained=True)

Expand Down
2 changes: 2 additions & 0 deletions python/ray/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
read_binary_files,
read_csv,
read_datasource,
read_images,
read_json,
read_numpy,
read_parquet,
Expand Down Expand Up @@ -70,6 +71,7 @@
"read_binary_files",
"read_csv",
"read_datasource",
"read_images",
"read_json",
"read_numpy",
"read_parquet",
Expand Down
74 changes: 2 additions & 72 deletions python/ray/data/datasource/image_folder_datasource.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,85 +40,15 @@

@DeveloperAPI
class ImageFolderDatasource(BinaryDatasource):
"""A datasource that lets you read datasets like `ImageNet <https://www.image-net.org/>`_.

This datasource works with any dataset where images are arranged in this way:

.. code-block::

root/dog/xxx.png
root/dog/xxy.png
root/dog/[...]/xxz.png

root/cat/123.png
root/cat/nsdf3.png
root/cat/[...]/asd932_.png

Datasets read with this datasource contain two columns: ``'image'`` and ``'label'``.

* The ``'image'`` column is of type
:py:class:`~ray.air.util.tensor_extensions.pandas.TensorDtype`. The shape of the
tensors are :math:`(H, W)` if the images are grayscale and :math:`(H, W, C)`
otherwise.
* The ``'label'`` column contains strings representing class names (e.g., 'cat').

Examples:
>>> import ray
>>> from ray.data.datasource import ImageFolderDatasource
>>> ds = ray.data.read_datasource( # doctest: +SKIP
... ImageFolderDatasource(),
... root="/data/imagenet/train",
... size=(224, 224)
... )
>>> sample = ds.take(1)[0] # doctest: +SKIP
>>> sample["image"].to_numpy().shape # doctest: +SKIP
(224, 224, 3)
>>> sample["label"] # doctest: +SKIP
'n01443537'

To convert class labels to integer-valued targets, use
:py:class:`~ray.data.preprocessors.OrdinalEncoder`.

>>> import ray
>>> from ray.data.preprocessors import OrdinalEncoder
>>> ds = ray.data.read_datasource( # doctest: +SKIP
... ImageFolderDatasource(),
... root="/data/imagenet/train",
... size=(224, 224)
... )
>>> oe = OrdinalEncoder(columns=["label"]) # doctest: +SKIP
>>> ds = oe.fit_transform(ds) # doctest: +SKIP
>>> sample = ds.take(1)[0] # doctest: +SKIP
>>> sample["label"] # doctest: +SKIP
71
""" # noqa: E501
"""A datasource that lets you read datasets like ImageNet."""

def create_reader(
self,
root: str,
size: Optional[Tuple[int, int]] = None,
mode: Optional[str] = None,
) -> "Reader[T]":
"""Return a :py:class:`~ray.data.datasource.Reader` that reads images.

.. warning::
If your dataset contains images of varying sizes and you don't specify
``size``, this datasource will error. To prevent errors, specify ``size``
or :ref:`disable tensor extension casting <disable_tensor_extension_casting>`.

Args:
root: Path to the dataset root.
size: The desired height and width of loaded images. If unspecified, images
retain their original shape.
mode: A `Pillow mode <https://pillow.readthedocs.io/en/stable/handbook/concepts.html#modes>`_
describing the desired type and depth of pixels. If unspecified, image
modes are inferred by
`Pillow <https://pillow.readthedocs.io/en/stable/index.html>`_.

Raises:
ValueError: if ``size`` contains non-positive numbers.
ValueError: if ``mode`` is unsupported.
""" # noqa: E501
"""Return a :py:class:`~ray.data.datasource.Reader` that reads images."""
if size is not None and len(size) != 2:
raise ValueError(
"Expected `size` to contain 2 integers for height and width, "
Expand Down
70 changes: 70 additions & 0 deletions python/ray/data/read_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
DefaultFileMetadataProvider,
DefaultParquetMetadataProvider,
FastFileMetadataProvider,
ImageFolderDatasource,
JSONDatasource,
NumpyDatasource,
ParquetBaseDatasource,
Expand Down Expand Up @@ -377,6 +378,75 @@ def read_parquet(
)


@PublicAPI(stability="alpha")
def read_images(
bveeramani marked this conversation as resolved.
Show resolved Hide resolved
root: str, size: Optional[Tuple[int, int]] = None, mode: Optional[str] = None
bveeramani marked this conversation as resolved.
Show resolved Hide resolved
):
bveeramani marked this conversation as resolved.
Show resolved Hide resolved
"""Read datasets like `ImageNet <https://www.image-net.org/>`_.

This function works with any directory where images are arranged in this way:

.. code-block::

root/dog/xxx.png
root/dog/xxy.png
root/dog/[...]/xxz.png

root/cat/123.png
root/cat/nsdf3.png
root/cat/[...]/asd932_.png
bveeramani marked this conversation as resolved.
Show resolved Hide resolved

Datasets read with this function contain two columns: ``'image'`` and ``'label'``.

* The ``'image'`` column is of type
:py:class:`~ray.air.util.tensor_extensions.pandas.TensorDtype`. The shape of the
tensors are :math:`(H, W)` if the images are grayscale and :math:`(H, W, C)`
otherwise.
* The ``'label'`` column contains strings representing class names (e.g., 'cat').

.. warning::
If your dataset contains images of varying sizes and you don't specify
``size``, this function will error. To prevent errors, specify ``size``
or :ref:`disable tensor extension casting <disable_tensor_extension_casting>`.

Examples:
>>> import ray
>>> ds = ray.data.read_images("/data/imagenet/train", size=(224, 224))
>>> sample = ds.take(1)[0] # doctest: +SKIP
>>> sample["image"].to_numpy().shape # doctest: +SKIP
(224, 224, 3)
>>> sample["label"] # doctest: +SKIP
'n01443537'

To convert class labels to integer-valued targets, use
:class:`~ray.data.preprocessors.OrdinalEncoder`.

>>> from ray.data.preprocessors import OrdinalEncoder
>>> oe = OrdinalEncoder(columns=["label"]) # doctest: +SKIP
>>> ds = oe.fit_transform(ds) # doctest: +SKIP
>>> sample = ds.take(1)[0] # doctest: +SKIP
>>> sample["label"] # doctest: +SKIP
71

Args:
root: Path to the dataset root.
size: The desired height and width of loaded images. If unspecified, images
retain their original shape.
mode: A `Pillow mode <https://pillow.readthedocs.io/en/stable/handbook/concepts.html#modes>`_
describing the desired type and depth of pixels. If unspecified, image
modes are inferred by
`Pillow <https://pillow.readthedocs.io/en/stable/index.html>`_.

Returns:
A :class:`~ray.data.Dataset` containing image and label columns.

Raises:
ValueError: if ``size`` contains non-positive numbers.
ValueError: if ``mode`` is unsupported.
""" # noqa: E501
return read_datasource(ImageFolderDatasource(), root=root, size=size, mode=mode)


@PublicAPI
def read_parquet_bulk(
paths: Union[str, List[str]],
Expand Down
Loading