[Datasets] Add `ImageFolderDatasource` #24641

bveeramani · 2022-05-10T06:06:47Z

Why are these changes needed?

Popular datasets like ImageNet and Tiny ImageNet are arranged in a specific layout like this:

root/dog/xxx.png
root/dog/xxy.png
root/dog/[...]/xxz.png

root/cat/123.png
root/cat/nsdf3.png
root/cat/[...]/asd932_.png

This PR adds a datasource that reads such datasets.

Related issue number

Closes #23977

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/data/impl/pandas_block.py

python/ray/data/datasource/file_meta_provider.py

doc/source/data/package-ref.rst

python/ray/data/datasource/image_folder_datasource.py

Co-authored-by: matthewdeng <[email protected]>

…into image-datasource

richardliaw · 2022-07-15T01:57:10Z

For the CUJ that amog posted can we add it in as an example to test in ci?

python/ray/data/datasource/image_folder_datasource.py

python/ray/data/read_api.py

bveeramani · 2022-07-15T17:21:09Z

For the CUJ that amog posted can we add it in as an example to test in ci?

No. There are issues with applying TorchVision transformations that are completely unrelated to this PR.

To make this code snippet work, you need to add several workarounds

def preprocess(df):
    preprocess = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    df["image"] = df["image"].map(preprocess)
    return df

def preprocess(df):
    preprocess = transforms.Compose([
        lambda ray_tensor: ray_tensor.to_numpy(),
        transforms.ToTensor(),
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        lambda torch_tensor: torch_tensor.numpy().astype(np.float32)
    ])
    df["image"] = TensorArray([preprocess(image) for image in df["image"]])
    return df

richardliaw · 2022-07-15T17:24:28Z

@bveeramani sorry, what I mean is we should have an end-to-end test such as Amog's example (with whatever modifications you want to make).

Can you please do that before merging?

bveeramani · 2022-07-15T18:16:31Z

Can you please do that before merging?

@richardliaw Added an E2E test. Wasn't sure what to test other than that there are no errors

python/ray/data/datasource/image_folder_datasource.py

clarkzinzow

LGTM, great work!

python/ray/data/tests/test_dataset_formats.py

Signed-off-by: Richard Liaw <[email protected]>

jiaodong · 2022-07-16T20:05:25Z

python/ray/data/datasource/image_folder_datasource.py

+        path, data = records[0]
+
+        image = iio.imread(data)
+        label = _get_class_from_path(path, self.root)


do we have any docs / past discussion about this part? Basically we're assuming we get the label based on user file path, which has to be structured in certain way in order to get the correct one without knobs needed to pass in custom label file or join ?

For example, if i read a s3 bucket with filenames of "dog.jpg", "dog_2.jpg" my dataloader will end up getting these string values by default.

Basically we're assuming we get the label based on user file path, which has to be structured in certain way in order to get the correct one without knobs needed to pass in custom label file or join ?

Yeah, that's right. The datasource assumes that the layout is structured in the same way as ImageNet. The functionality of the datasource is based on that of TorchVision's ImageFolder.

For example, if i read a s3 bucket with filenames of "dog.jpg", "dog_2.jpg" my dataloader will end up getting these string values by default.

Yeah, you're right. We don't validate that the label corresponds to a directory. In this case, we could raise an error stating that the folder isn't structured correctly.

Alternatively, if images aren't stored in a directory, we could set the label to None.

If images aren't stored in a sub-directory, then the image's label will be set to `None`. .. code-block:: root/dog/xxx.png # Label is 'dog' root/123.jpg. # Label is `None`

Co-authored-by: matthewdeng <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Signed-off-by: Xiaowei Jiang <[email protected]>

Co-authored-by: matthewdeng <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Signed-off-by: Stefan van der Kleij <[email protected]>

Add files

f6eb66e

bveeramani requested review from ericl, scv119, clarkzinzow and jjyao as code owners May 10, 2022 06:06

bveeramani changed the title ~~[Datasets] Add ImageFolderDatasource~~ [Datasets] [WIP] Add ImageFolderDatasource May 10, 2022

bveeramani marked this pull request as draft May 10, 2022 06:07

bveeramani added 4 commits May 11, 2022 01:35

Add files

8fb8d75

Fix stuff

083c9bc

Rename file

4f0b8ce

Rename file

cd6e25e

bveeramani commented May 11, 2022

View reviewed changes

python/ray/data/impl/pandas_block.py Outdated Show resolved Hide resolved

Update image_folder_datasource.py

98b07f7

bveeramani changed the title ~~[Datasets] [WIP] Add ImageFolderDatasource~~ [Datasets] Add ImageFolderDatasource May 11, 2022

bveeramani assigned matthewdeng, clarkzinzow and amogkam May 11, 2022

bveeramani marked this pull request as ready for review May 11, 2022 08:58

bveeramani added 2 commits May 11, 2022 02:15

Update docs

6f7b2eb

Update file_meta_provider.py

db05d35

bveeramani commented May 11, 2022

View reviewed changes

python/ray/data/datasource/file_meta_provider.py Outdated Show resolved Hide resolved

bveeramani added 2 commits May 11, 2022 02:27

Update image_folder_datasource.py

52ae3c8

Update Makefile

813f5de

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 11, 2022

matthewdeng reviewed May 12, 2022

View reviewed changes

Update python/ray/data/datasource/image_folder_datasource.py

d68fe1d

Co-authored-by: matthewdeng <[email protected]>

bveeramani requested a review from maxpumperla as a code owner May 18, 2022 08:58

bveeramani added 2 commits May 18, 2022 02:09

Re-add warning

45989c2

Merge branch 'image-datasource' of https://github.com/bveeramani/ray …

cd079d9

…into image-datasource

bveeramani mentioned this pull request May 18, 2022

[Datasets] Remove FastFileMetadataProvider warning #24909

Closed

6 tasks

matthewdeng reviewed Jul 15, 2022

View reviewed changes

python/ray/data/datasource/image_folder_datasource.py Outdated Show resolved Hide resolved

python/ray/data/datasource/image_folder_datasource.py Outdated Show resolved Hide resolved

python/ray/data/read_api.py Outdated Show resolved Hide resolved

Update documentation and add test

7c0a317

Remove target column

e07d1f8

richardliaw reviewed Jul 15, 2022

View reviewed changes

python/ray/data/datasource/image_folder_datasource.py Outdated Show resolved Hide resolved

bveeramani added 3 commits July 15, 2022 11:25

Change error type from ValueError to ImportError

298c144

Remove read_image_folder

dc9c6e2

Merge branch 'master' into image-datasource

3d8e005

ericl approved these changes Jul 15, 2022

View reviewed changes

clarkzinzow approved these changes Jul 15, 2022

View reviewed changes

bveeramani added 3 commits July 15, 2022 13:04

Add API annotation

1d4d920

Add missing DeveloperAPI import

0f5e089

Skip doctests

18dd566

ericl mentioned this pull request Jul 15, 2022

[AIR][CUJ] Add GPU bench prediction benchmark #26614

Merged

6 tasks

matthewdeng reviewed Jul 16, 2022

View reviewed changes

python/ray/data/tests/test_dataset_formats.py Outdated Show resolved Hide resolved

update

70a47d7

Signed-off-by: Richard Liaw <[email protected]>

richardliaw merged commit 34cf1f1 into ray-project:master Jul 16, 2022

jiaodong reviewed Jul 16, 2022

View reviewed changes

bveeramani deleted the image-datasource branch July 16, 2022 23:05

Riatre mentioned this pull request Jul 18, 2022

Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" #26525

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Add `ImageFolderDatasource` #24641

[Datasets] Add `ImageFolderDatasource` #24641

bveeramani commented May 10, 2022 •

edited

Loading

richardliaw commented Jul 15, 2022

bveeramani commented Jul 15, 2022

richardliaw commented Jul 15, 2022

bveeramani commented Jul 15, 2022 •

edited

Loading

clarkzinzow left a comment

jiaodong Jul 16, 2022

bveeramani Jul 16, 2022

[Datasets] Add ImageFolderDatasource #24641

[Datasets] Add ImageFolderDatasource #24641

Conversation

bveeramani commented May 10, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

richardliaw commented Jul 15, 2022

bveeramani commented Jul 15, 2022

richardliaw commented Jul 15, 2022

bveeramani commented Jul 15, 2022 • edited Loading

clarkzinzow left a comment

Choose a reason for hiding this comment

jiaodong Jul 16, 2022

Choose a reason for hiding this comment

bveeramani Jul 16, 2022

Choose a reason for hiding this comment

[Datasets] Add `ImageFolderDatasource` #24641

[Datasets] Add `ImageFolderDatasource` #24641

bveeramani commented May 10, 2022 •

edited

Loading

bveeramani commented Jul 15, 2022 •

edited

Loading