Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OA] Fixes for Batch Inference Basics template #156

Merged
merged 8 commits into from
Mar 28, 2024
Merged

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Mar 27, 2024

Address feedback / fixes from dogfooding batch LLM template:

  • Fix bug with text vs item column from from_items() call
  • Use Mistral model by default, so users are not required to supply HF token
  • Use single quotes instead of triple quotes for prompt data
  • Move scaling sections to after step 4 vLLM requires GPUs, so need to talk about GPUs in the toy setup as well.
  • Clean up title + headers
  • Add accelerator type (A10G or L4)

Alongside https://github.com/anyscale/product/pull/27262

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Comment on lines 215 to 219
"## Scaling with GPUs\n",
"\n",
"Apply batch inference for all input data with the Ray Data [`map_batches`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html) method. When using vLLM, LLM instances require GPUs; here, we will demonstrate how to configure Ray Data to scale the number of LLM instances and GPUs needed.\n",
"\n",
"To use GPUs for inference in the Workspace, we can specify `num_gpus` and `concurrency` in the `ds.map_batches()` call below to indicate the number of LLM instances and the number of GPUs per LLM instance, respectively. For example, with `concurrency=4` and `num_gpus=1`, we have 4 LLM instances, each using 1 GPU, so we need 4 GPUs total."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, since vLLM requires GPUs, i had to put this section before the "scaling to larger dataset," since we will need GPUs for even the toy setup.

@scottjlee scottjlee marked this pull request as ready for review March 27, 2024 23:13
@@ -262,7 +212,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Apply batch inference for all input data with the Ray Data [`map_batches`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html) method. Here, you can easily configure Ray Data to scale the number of LLM instances and compute (number of GPUs to use)."
"## Scaling with GPUs\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is under step 4, use "###".

@@ -371,7 +387,25 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary\n",
"## Submitting an Anyscale Job\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this for now, the jobs tutorial isn't ready yet. When it is, we can link to that instead of repeating the same content in each template.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this will be "ray job submit" within workspaces.

"cell_type": "markdown",
"metadata": {},
"source": [
"## Scaling to a larger dataset\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment below, if these are all under section 4 they need to be one level deeper as headings.

" # Specify the number of GPUs required per LLM instance.\n",
" num_gpus=num_gpus_per_instance,\n",
" num_gpus=1,\n",
Copy link
Contributor

@ericl ericl Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I ran this, I got

"""
raise ValueError(
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting thedtype flag in CLI, for example: --dtype=half.
"""

Similar to in #148 I think you need to set accelerator_type: A10G and/or make a function that returns A10G or L4 depending on AWS or GCP.

Or, set the dtype=half.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah my bad, i was testing this on a custom workspace with A10s already configured, that makes sense. Adding a similar function as #148 which gets A10G/L4 depending on the cloud platform.

@ericl
Copy link
Contributor

ericl commented Mar 27, 2024

Please ping when it runs correctly in OA, still doesn't work for me

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
@scottjlee
Copy link
Contributor Author

@ericl tested on OA workspace (link) with serverless, ready for another look. Thanks!

@scottjlee scottjlee requested a review from ericl March 28, 2024 00:18
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, works e2e now, thanks!

Btw, I couldn't access the workspace you linked probably b/c it wasn't in the staging dogfood org, for sharing workspaces you probably want to use the "Try new UI" function in staging.

@ericl ericl merged commit 20613b8 into main Mar 28, 2024
1 check passed
anmscale pushed a commit that referenced this pull request Jun 22, 2024
[OA] Fixes for Batch Inference Basics template
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants