[Batch Prediction] [Doc] Output jumps from 1/200 to 1/1 and "tench" output is suspicious #39028

architkulkarni · 2023-08-28T22:55:39Z

What happened + What you expected to happen

Running the tutorial from https://docs.ray.io/en/latest/data/examples/huggingface_vit_batch_prediction.html, we want the output to confirm that the prediction worked and that it used GPUs. But the current output is a little suspicious:

the progress bar jumped from 1/200 to 1/1
~~the five sample images were all "tench" and there are only two distinct memory locations out of the five. (0x7B37546CF7F0 and 0x7B37546AE430)~~ no longer worried about this, but still curious how different images could have the same memory location.

I'm not sure if it's a setup issue, an issue with the script, or a bug in Ray, or if it's working as expected.

Running: 62.0/64.0 CPU, 4.0/4.0 GPU, 955.57 MiB/12.83 GiB object_store_memory:   0%|          | 0/200 [00:05<?, ?it/s]
Running: 61.0/64.0 CPU, 4.0/4.0 GPU, 999.41 MiB/12.83 GiB object_store_memory:   0%|          | 0/200 [00:05<?, ?it/s]
Running: 61.0/64.0 CPU, 4.0/4.0 GPU, 999.41 MiB/12.83 GiB object_store_memory:   0%|          | 1/200 [00:05<17:04,  5.15s/it]
Running: 61.0/64.0 CPU, 4.0/4.0 GPU, 1008.68 MiB/12.83 GiB object_store_memory:   0%|          | 1/200 [00:05<17:04,  5.15s/it]
Running: 61.0/64.0 CPU, 4.0/4.0 GPU, 1008.68 MiB/12.83 GiB object_store_memory: 100%|██████████| 1/1 [00:05<00:00,  5.15s/it]  
                                                                                                                             
2023-08-22 15:48:33,905 WARNING actor_pool_map_operator.py:267 -- To ensure full parallelization across an actor pool of size 4, the specified batch size should be at most 5. Your configured batch size for this operator was 16.
<PIL.Image.Image image mode=RGB size=500x375 at 0x7B37546CF7F0>
Label:  tench, Tinca tinca
<PIL.Image.Image image mode=RGB size=500x375 at 0x7B37546AE430>
Label:  tench, Tinca tinca
<PIL.Image.Image image mode=RGB size=500x375 at 0x7B37546CF430>
Label:  tench, Tinca tinca
<PIL.Image.Image image mode=RGB size=500x375 at 0x7B37546AE430>
Label:  tench, Tinca tinca
<PIL.Image.Image image mode=RGB size=500x375 at 0x7B37546CF7F0>
Label:  tench, Tinca tinca

Versions / Dependencies

Ray 2.6.3, Kuberay 0.6.0

Reproduction script

For more details about how it was run, see ray-project/kuberay#1361. I know the GPUs were available and being used (for example we got CUDA out of memory errors until we reduced the batch size.). The use case is a KubeRay tutorial which uses this workload as an example of how to run a real workload on KubeRay.

import ray
s3_uri = "s3://anonymous@air-example-data-2/imagenette2/val/"
ds = ray.data.read_images(
    s3_uri, mode="RGB"
)
ds
from typing import Dict
import numpy as np
from transformers import pipeline
from PIL import Image
BATCH_SIZE = 16
class ImageClassifier:
    def __init__(self):
        # If doing CPU inference, set `device="cpu"` instead.
        self.classifier = pipeline("image-classification", model="google/vit-base-patch16-224", device=0) # TODO:archit
    def __call__(self, batch: Dict[str, np.ndarray]):
        # Convert the numpy array of images into a list of PIL images which is the format the HF pipeline expects.
        outputs = self.classifier(
            [Image.fromarray(image_array) for image_array in batch["image"]], 
            top_k=1, 
            batch_size=BATCH_SIZE)
        
        # `outputs` is a list of length-one lists. For example:
        # [[{'score': '...', 'label': '...'}], ..., [{'score': '...', 'label': '...'}]]
        batch["score"] = [output[0]["score"] for output in outputs]
        batch["label"] = [output[0]["label"] for output in outputs]
        return batch
predictions = ds.map_batches(
    ImageClassifier,
    compute=ray.data.ActorPoolStrategy(size=4), # Change this number based on the number of GPUs in your cluster.
    num_gpus=1, # Specify 1 GPU per model replica.
    batch_size=BATCH_SIZE # Use the largest batch size that can fit on our GPUs
)
prediction_batch = predictions.take_batch(5)
from PIL import Image
print("A few sample predictions: ")
for image, prediction in zip(prediction_batch["image"], prediction_batch["label"]):
    img = Image.fromarray(image)
    # Display the image
    img.show()
    print("Label: ", prediction)
# Write to local disk, or external storage, e.g. S3
# ds.write_parquet("s3://my_bucket/my_folder")

Issue Severity

None

The text was updated successfully, but these errors were encountered:

architkulkarni · 2023-08-29T18:51:43Z

Reproducible, this time I got

Running: 54.0/54.0 CPU, 4.0/4.0 GPU, 769.36 MiB/4.01 GiB object_store_memory:   0%|          | 0/200 [00:04<?, ?it/s]
Running: 52.0/54.0 CPU, 4.0/4.0 GPU, 780.87 MiB/4.01 GiB object_store_memory:   0%|          | 0/200 [00:04<?, ?it/s]
Running: 52.0/54.0 CPU, 4.0/4.0 GPU, 799.0 MiB/4.01 GiB object_store_memory:   0%|          | 0/200 [00:05<?, ?it/s] 
Running: 52.0/54.0 CPU, 4.0/4.0 GPU, 799.0 MiB/4.01 GiB object_store_memory:   0%|          | 1/200 [00:05<16:39,  5.02s/it]
Running: 54.0/54.0 CPU, 4.0/4.0 GPU, 782.99 MiB/4.01 GiB object_store_memory:   0%|          | 1/200 [00:05<16:39,  5.02s/it]
Running: 54.0/54.0 CPU, 4.0/4.0 GPU, 782.99 MiB/4.01 GiB object_store_memory: 100%|██████████| 1/1 [00:05<00:00,  5.02s/it]

This time I printed out more sample predictions and saw the same memory address for two images which are definitely different:

<PIL.Image.Image image mode=RGB size=153x202 at 0x795AE0904910>
Label:  coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch
<PIL.Image.Image image mode=RGB size=470x329 at 0x795AE0904910>
Label:  tench, Tinca tinca

so I wonder if the memory location must be the location of the batch or something, not the location of the image.

scottjlee · 2023-10-02T22:48:57Z

We have a related PR which may end up resolving the 1/1 progress bar issue: #39828
We can revisit this PR after it is merged, to see if the progress bar issue is still present.

scottjlee · 2023-10-02T23:03:39Z

ah, actually i think the 1/1 progress bar may be coming from the .take_batch() call, because this adds a Limit operator with 1 task (which is creating the 1/1 progress bar). This is a bit confusing, so we will discuss internally how to clarify this view.

Regarding the memory location, I don't think ray is modifying the memory location here, we think it's purely related to PIL? Are you able to see otherwise when reading the images with just PIL and not through ray? @architkulkarni

architkulkarni · 2023-10-02T23:09:09Z

Sure, maybe the memory location is a distraction. The main point of the issue is that the output is baffling for a first-time user reading through the tutorial. It prints out five identical "tench" lines. If it's working as intended, then the doc or the sample script should be updated to make it more obvious that it worked so the user can feel successful.

Also the display(img) in the tutorial doesn't work out of the box (where is display defined?)

scottjlee · 2023-10-02T23:20:07Z

The 5 "tench" lines come from displaying each of the 5 examples from predictions.take_batch(5). On the actual example page output itself, it shows the example image then the label (which makes more sense I think):

Would the most helpful addition here be to add something like "Successfully loaded 5 samples" at the end? Or any other suggestion to make it obvious it succeeded (without having to rely on displaying images on the screen, which may or may not be possible depending on how the user is running the code).

I think display is from IPython.display, which we need for rendering the sample image in the jupyter notebook containing the example code.

architkulkarni · 2023-10-02T23:26:11Z

I see, I missed the fact that you could scroll through the images. (There's an invisible scroll bar)

For display(img) I think the average user is just going to copy paste the code and expect it to work. So maybe we can explicitly include the IPython.display import and tell the user to use IPython or Jupyter.

The progress bar is still confusing though, because it goes from 0/200 to 1/200 and then stops (even assuming the 1/1 is from an unrelated call as you suggested).

architkulkarni · 2023-10-02T23:28:24Z

Ideally there would be more than one type of fish in the output so we can see it classifying different fish... Not sure if there's a way to guarantee that though.

anyscalesam · 2023-10-03T00:01:11Z

@architkulkarni can we just chat in person to determine priority.

scottjlee · 2023-10-03T00:01:58Z

i think updating the example will be pretty quick, just need to understand what would be the best way to show that it succeeded.
for the progress bar issue, i'll create a new issue with a ray data-only reproducible example.

architkulkarni added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 28, 2023

xieus added the data Ray Data-related issues label Sep 25, 2023

scottjlee added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 2, 2023

architkulkarni removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 2, 2023

anyscalesam added P1 Issue that should be fixed within a few weeks docs An issue or change related to documentation and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 3, 2023

scottjlee mentioned this issue Oct 3, 2023

Update Image Classification Batch Inference with Huggingface Vision Transformer example #40075

Merged

8 tasks

bveeramani closed this as completed in #40075 Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Batch Prediction] [Doc] Output jumps from 1/200 to 1/1 and "tench" output is suspicious #39028

[Batch Prediction] [Doc] Output jumps from 1/200 to 1/1 and "tench" output is suspicious #39028

architkulkarni commented Aug 28, 2023 •

edited

Loading

architkulkarni commented Aug 29, 2023

scottjlee commented Oct 2, 2023

scottjlee commented Oct 2, 2023

architkulkarni commented Oct 2, 2023

scottjlee commented Oct 2, 2023

architkulkarni commented Oct 2, 2023 •

edited

Loading

architkulkarni commented Oct 2, 2023

anyscalesam commented Oct 3, 2023

scottjlee commented Oct 3, 2023

[Batch Prediction] [Doc] Output jumps from 1/200 to 1/1 and "tench" output is suspicious #39028

[Batch Prediction] [Doc] Output jumps from 1/200 to 1/1 and "tench" output is suspicious #39028

Comments

architkulkarni commented Aug 28, 2023 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

architkulkarni commented Aug 29, 2023

scottjlee commented Oct 2, 2023

scottjlee commented Oct 2, 2023

architkulkarni commented Oct 2, 2023

scottjlee commented Oct 2, 2023

architkulkarni commented Oct 2, 2023 • edited Loading

architkulkarni commented Oct 2, 2023

anyscalesam commented Oct 3, 2023

scottjlee commented Oct 3, 2023

architkulkarni commented Aug 28, 2023 •

edited

Loading

architkulkarni commented Oct 2, 2023 •

edited

Loading