Skip to content

Commit

Permalink
[Data][Docs] Fix hf_quick_start.py (ray-project#35240)
Browse files Browse the repository at this point in the history
`hf_quick_start.py` was failing with

> ValueError: ArrowVariableShapedTensorArray only supports heterogeneous-shaped tensor collections, not arbitrarily nested ragged tensors. Got arrays: [('dtype=object', 'shape=(1,)'), ('dtype=object', 'shape=(1,)')]

This is because we're returning an object that looks like

```python
{"output":
    [[{'generated_text': 'Complete this page to stay up to date with our latest news in aviation related news. You can also'}],
     [{'generated_text': "for me. We could use those resources as time goes on. We'll get to it in the"}]]
}
```

from a UDF.

This PR updates the UDF so it returns object like

```python
{"output": [
    'Complete this page to stay up to date with our latest news in aviation related news. You can also',
    "for me. We could use those resources as time goes on. We'll get to it in the"
]}
```

Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: e428265 <[email protected]>
  • Loading branch information
bveeramani authored and arvind-chandra committed Aug 31, 2023
1 parent d3eadb0 commit ac77b10
Showing 1 changed file with 6 additions and 4 deletions.
10 changes: 6 additions & 4 deletions doc/source/data/doc_code/hf_quick_start.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,9 @@ def __init__(self):
self.model = pipeline("text-generation", model="gpt2")

def __call__(self, batch: Dict[str, np.ndarray]):
model_out = self.model(list(batch["data"]), max_length=20)
return {"output": model_out}
model_out = self.model(list(batch["data"]), max_length=20, num_return_sequences=1)
batch["output"] = [sequence[0]["generated_text"] for sequence in model_out]
return batch

scale = ray.data.ActorPoolStrategy(size=2)
predictions = ds.map_batches(HuggingFacePredictor, compute=scale)
Expand Down Expand Up @@ -54,8 +55,9 @@ def __init__(self): # <1>
self.model = pipeline("text-generation", model="gpt2")

def __call__(self, batch: Dict[str, np.ndarray]): # <2>
model_out = self.model(list(batch["data"]), max_length=20)
return {"output": np.asarray(model_out)}
model_out = self.model(list(batch["data"]), max_length=20, num_return_sequences=1)
batch["output"] = [sequence[0]["generated_text"] for sequence in model_out]
return batch
# __hf_quickstart_model_end__


Expand Down

0 comments on commit ac77b10

Please sign in to comment.