-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OA] Fixes for Batch Inference Basics template #156
Conversation
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
templates/batch-llm/README.ipynb
Outdated
"## Scaling with GPUs\n", | ||
"\n", | ||
"Apply batch inference for all input data with the Ray Data [`map_batches`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html) method. When using vLLM, LLM instances require GPUs; here, we will demonstrate how to configure Ray Data to scale the number of LLM instances and GPUs needed.\n", | ||
"\n", | ||
"To use GPUs for inference in the Workspace, we can specify `num_gpus` and `concurrency` in the `ds.map_batches()` call below to indicate the number of LLM instances and the number of GPUs per LLM instance, respectively. For example, with `concurrency=4` and `num_gpus=1`, we have 4 LLM instances, each using 1 GPU, so we need 4 GPUs total." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, since vLLM requires GPUs, i had to put this section before the "scaling to larger dataset," since we will need GPUs for even the toy setup.
templates/batch-llm/README.ipynb
Outdated
@@ -262,7 +212,11 @@ | |||
"cell_type": "markdown", | |||
"metadata": {}, | |||
"source": [ | |||
"Apply batch inference for all input data with the Ray Data [`map_batches`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html) method. Here, you can easily configure Ray Data to scale the number of LLM instances and compute (number of GPUs to use)." | |||
"## Scaling with GPUs\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is under step 4, use "###".
templates/batch-llm/README.ipynb
Outdated
@@ -371,7 +387,25 @@ | |||
"cell_type": "markdown", | |||
"metadata": {}, | |||
"source": [ | |||
"### Summary\n", | |||
"## Submitting an Anyscale Job\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this for now, the jobs tutorial isn't ready yet. When it is, we can link to that instead of repeating the same content in each template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this will be "ray job submit" within workspaces.
templates/batch-llm/README.ipynb
Outdated
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Scaling to a larger dataset\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment below, if these are all under section 4 they need to be one level deeper as headings.
" # Specify the number of GPUs required per LLM instance.\n", | ||
" num_gpus=num_gpus_per_instance,\n", | ||
" num_gpus=1,\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I ran this, I got
"""
raise ValueError(
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting thedtype
flag in CLI, for example: --dtype=half.
"""
Similar to in #148 I think you need to set accelerator_type: A10G
and/or make a function that returns A10G or L4 depending on AWS or GCP.
Or, set the dtype=half.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah my bad, i was testing this on a custom workspace with A10s already configured, that makes sense. Adding a similar function as #148 which gets A10G/L4 depending on the cloud platform.
Please ping when it runs correctly in OA, still doesn't work for me |
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, works e2e now, thanks!
Btw, I couldn't access the workspace you linked probably b/c it wasn't in the staging dogfood org, for sharing workspaces you probably want to use the "Try new UI" function in staging.
Signed-off-by: Scott Lee <[email protected]>
[OA] Fixes for Batch Inference Basics template
Address feedback / fixes from dogfooding batch LLM template:
text
vsitem
column fromfrom_items()
callMove scaling sections to after step 4vLLM requires GPUs, so need to talk about GPUs in the toy setup as well.Alongside https://github.com/anyscale/product/pull/27262