Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Batch Prediction] [Doc] Output jumps from 1/200 to 1/1 and "tench" output is suspicious #39028

Closed
architkulkarni opened this issue Aug 28, 2023 · 9 comments · Fixed by #40075
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues docs An issue or change related to documentation P1 Issue that should be fixed within a few weeks

Comments

@architkulkarni
Copy link
Contributor

architkulkarni commented Aug 28, 2023

What happened + What you expected to happen

Running the tutorial from https://docs.ray.io/en/latest/data/examples/huggingface_vit_batch_prediction.html, we want the output to confirm that the prediction worked and that it used GPUs. But the current output is a little suspicious:

  • the progress bar jumped from 1/200 to 1/1
  • the five sample images were all "tench" and there are only two distinct memory locations out of the five. (0x7B37546CF7F0 and 0x7B37546AE430) no longer worried about this, but still curious how different images could have the same memory location.

I'm not sure if it's a setup issue, an issue with the script, or a bug in Ray, or if it's working as expected.

Running: 62.0/64.0 CPU, 4.0/4.0 GPU, 955.57 MiB/12.83 GiB object_store_memory:   0%|          | 0/200 [00:05<?, ?it/s]
Running: 61.0/64.0 CPU, 4.0/4.0 GPU, 999.41 MiB/12.83 GiB object_store_memory:   0%|          | 0/200 [00:05<?, ?it/s]
Running: 61.0/64.0 CPU, 4.0/4.0 GPU, 999.41 MiB/12.83 GiB object_store_memory:   0%|          | 1/200 [00:05<17:04,  5.15s/it]
Running: 61.0/64.0 CPU, 4.0/4.0 GPU, 1008.68 MiB/12.83 GiB object_store_memory:   0%|          | 1/200 [00:05<17:04,  5.15s/it]
Running: 61.0/64.0 CPU, 4.0/4.0 GPU, 1008.68 MiB/12.83 GiB object_store_memory: 100%|██████████| 1/1 [00:05<00:00,  5.15s/it]  
                                                                                                                             
2023-08-22 15:48:33,905 WARNING actor_pool_map_operator.py:267 -- To ensure full parallelization across an actor pool of size 4, the specified batch size should be at most 5. Your configured batch size for this operator was 16.
<PIL.Image.Image image mode=RGB size=500x375 at 0x7B37546CF7F0>
Label:  tench, Tinca tinca
<PIL.Image.Image image mode=RGB size=500x375 at 0x7B37546AE430>
Label:  tench, Tinca tinca
<PIL.Image.Image image mode=RGB size=500x375 at 0x7B37546CF430>
Label:  tench, Tinca tinca
<PIL.Image.Image image mode=RGB size=500x375 at 0x7B37546AE430>
Label:  tench, Tinca tinca
<PIL.Image.Image image mode=RGB size=500x375 at 0x7B37546CF7F0>
Label:  tench, Tinca tinca

Versions / Dependencies

Ray 2.6.3, Kuberay 0.6.0

Reproduction script

For more details about how it was run, see ray-project/kuberay#1361. I know the GPUs were available and being used (for example we got CUDA out of memory errors until we reduced the batch size.). The use case is a KubeRay tutorial which uses this workload as an example of how to run a real workload on KubeRay.

import ray
s3_uri = "s3://anonymous@air-example-data-2/imagenette2/val/"
ds = ray.data.read_images(
    s3_uri, mode="RGB"
)
ds
from typing import Dict
import numpy as np
from transformers import pipeline
from PIL import Image
BATCH_SIZE = 16
class ImageClassifier:
    def __init__(self):
        # If doing CPU inference, set `device="cpu"` instead.
        self.classifier = pipeline("image-classification", model="google/vit-base-patch16-224", device=0) # TODO:archit
    def __call__(self, batch: Dict[str, np.ndarray]):
        # Convert the numpy array of images into a list of PIL images which is the format the HF pipeline expects.
        outputs = self.classifier(
            [Image.fromarray(image_array) for image_array in batch["image"]], 
            top_k=1, 
            batch_size=BATCH_SIZE)
        
        # `outputs` is a list of length-one lists. For example:
        # [[{'score': '...', 'label': '...'}], ..., [{'score': '...', 'label': '...'}]]
        batch["score"] = [output[0]["score"] for output in outputs]
        batch["label"] = [output[0]["label"] for output in outputs]
        return batch
predictions = ds.map_batches(
    ImageClassifier,
    compute=ray.data.ActorPoolStrategy(size=4), # Change this number based on the number of GPUs in your cluster.
    num_gpus=1, # Specify 1 GPU per model replica.
    batch_size=BATCH_SIZE # Use the largest batch size that can fit on our GPUs
)
prediction_batch = predictions.take_batch(5)
from PIL import Image
print("A few sample predictions: ")
for image, prediction in zip(prediction_batch["image"], prediction_batch["label"]):
    img = Image.fromarray(image)
    # Display the image
    img.show()
    print("Label: ", prediction)
# Write to local disk, or external storage, e.g. S3
# ds.write_parquet("s3://my_bucket/my_folder")

Issue Severity

None

@architkulkarni architkulkarni added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 28, 2023
@architkulkarni
Copy link
Contributor Author

Reproducible, this time I got

Running: 54.0/54.0 CPU, 4.0/4.0 GPU, 769.36 MiB/4.01 GiB object_store_memory:   0%|          | 0/200 [00:04<?, ?it/s]
Running: 52.0/54.0 CPU, 4.0/4.0 GPU, 780.87 MiB/4.01 GiB object_store_memory:   0%|          | 0/200 [00:04<?, ?it/s]
Running: 52.0/54.0 CPU, 4.0/4.0 GPU, 799.0 MiB/4.01 GiB object_store_memory:   0%|          | 0/200 [00:05<?, ?it/s] 
Running: 52.0/54.0 CPU, 4.0/4.0 GPU, 799.0 MiB/4.01 GiB object_store_memory:   0%|          | 1/200 [00:05<16:39,  5.02s/it]
Running: 54.0/54.0 CPU, 4.0/4.0 GPU, 782.99 MiB/4.01 GiB object_store_memory:   0%|          | 1/200 [00:05<16:39,  5.02s/it]
Running: 54.0/54.0 CPU, 4.0/4.0 GPU, 782.99 MiB/4.01 GiB object_store_memory: 100%|██████████| 1/1 [00:05<00:00,  5.02s/it]  

This time I printed out more sample predictions and saw the same memory address for two images which are definitely different:

<PIL.Image.Image image mode=RGB size=153x202 at 0x795AE0904910>
Label:  coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch
<PIL.Image.Image image mode=RGB size=470x329 at 0x795AE0904910>
Label:  tench, Tinca tinca

so I wonder if the memory location must be the location of the batch or something, not the location of the image.

@xieus xieus added the data Ray Data-related issues label Sep 25, 2023
@scottjlee
Copy link
Contributor

We have a related PR which may end up resolving the 1/1 progress bar issue: #39828
We can revisit this PR after it is merged, to see if the progress bar issue is still present.

@scottjlee
Copy link
Contributor

ah, actually i think the 1/1 progress bar may be coming from the .take_batch() call, because this adds a Limit operator with 1 task (which is creating the 1/1 progress bar). This is a bit confusing, so we will discuss internally how to clarify this view.

Regarding the memory location, I don't think ray is modifying the memory location here, we think it's purely related to PIL? Are you able to see otherwise when reading the images with just PIL and not through ray? @architkulkarni

@scottjlee scottjlee added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 2, 2023
@architkulkarni
Copy link
Contributor Author

Sure, maybe the memory location is a distraction. The main point of the issue is that the output is baffling for a first-time user reading through the tutorial. It prints out five identical "tench" lines. If it's working as intended, then the doc or the sample script should be updated to make it more obvious that it worked so the user can feel successful.

Also the display(img) in the tutorial doesn't work out of the box (where is display defined?)

@architkulkarni architkulkarni removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 2, 2023
@scottjlee
Copy link
Contributor

The 5 "tench" lines come from displaying each of the 5 examples from predictions.take_batch(5). On the actual example page output itself, it shows the example image then the label (which makes more sense I think):
Screenshot at Oct 02 16-16-03

Would the most helpful addition here be to add something like "Successfully loaded 5 samples" at the end? Or any other suggestion to make it obvious it succeeded (without having to rely on displaying images on the screen, which may or may not be possible depending on how the user is running the code).

I think display is from IPython.display, which we need for rendering the sample image in the jupyter notebook containing the example code.

@architkulkarni
Copy link
Contributor Author

architkulkarni commented Oct 2, 2023

I see, I missed the fact that you could scroll through the images. (There's an invisible scroll bar)

For display(img) I think the average user is just going to copy paste the code and expect it to work. So maybe we can explicitly include the IPython.display import and tell the user to use IPython or Jupyter.

The progress bar is still confusing though, because it goes from 0/200 to 1/200 and then stops (even assuming the 1/1 is from an unrelated call as you suggested).

@architkulkarni
Copy link
Contributor Author

Ideally there would be more than one type of fish in the output so we can see it classifying different fish... Not sure if there's a way to guarantee that though.

@anyscalesam anyscalesam added P1 Issue that should be fixed within a few weeks docs An issue or change related to documentation and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 3, 2023
@anyscalesam
Copy link
Contributor

@architkulkarni can we just chat in person to determine priority.

@scottjlee
Copy link
Contributor

i think updating the example will be pretty quick, just need to understand what would be the best way to show that it succeeded.
for the progress bar issue, i'll create a new issue with a ray data-only reproducible example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues docs An issue or change related to documentation P1 Issue that should be fixed within a few weeks
Projects
None yet
4 participants