Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LLM Pipeline #137

Merged
merged 15 commits into from
Sep 30, 2024
Merged

Add LLM Pipeline #137

merged 15 commits into from
Sep 30, 2024

Conversation

kyriediculous
Copy link
Contributor

No description provided.

@ad-astra-video
Copy link
Collaborator

ad-astra-video commented Sep 21, 2024

@rickstaa I have reviewed this and confirmed it works. Code needed to be rebased with new code gen updates from recent SDK releases. @kyriediculous can update this PR or we can move to the other PR.

Some brief research provided there are other implementations to serve LLM pipelines which was also briefly discussed with @kyriediculous. Settled on alternative implementations can be researched and tested if the need arises from user feedback. LLM SPE will continue to support and enhance this pipeline to suite the network requirements for the LLM pipeline as the network evolves.

Notes from review/testing:

  • I like the streamed response simply starting a second thread to do the inference using a pre-built text streamer from transformers library to send the text chunks back. Note the api for this class may change in the future per note in the transformers documentation .

There was only a couple small changes I made in addition to the changes needed to rebase this PR:

  1. Moved the check_torch_cuda.py to the dev folder since it only provides a helper to check cuda version.
  2. Fixed the logic on returning containers for managed containers. For streamed responses the container was returned right after the streamed response was started. This would allow another request to come in to the GPU and would potentially significantly slow down the first request that was still processing. I would suggest we start with 1 request in flight per GPU for managed containers and target a future enhancement to increase this with thorough testing and documentation of multiple requests in flight on one GPU can be completed timely.
    • Note, external containers are not limited to this one request in flight at a time. It is expected that external containers have their own load balancing logic and return 500 error when overloaded. Also, the external containers start to slow down token/second as each request is added concurrently. I experienced connections closing/timing out when overloading a GPU to much when testing locally with 5 concurrent requests on 3080.

@kyriediculous
Copy link
Contributor Author

All comments have been addressed and commit history has been cleaned up

2024-09-27 13:32:48,339 INFO:     Started server process [1]
2024-09-27 13:32:48,339 INFO:     Waiting for application startup.
2024-09-27 13:32:55,774 - app.pipelines.llm - INFO - Local model path: /models/models--meta-llama--Meta-Llama-3.1-8B-Instruct
2024-09-27 13:32:55,774 - app.pipelines.llm - INFO - Directory contents: ['snapshots', 'refs', 'blobs']
2024-09-27 13:32:55,774 - app.pipelines.llm - INFO - Using fp16/bf16 precision
2024-09-27 13:32:55,798 - app.pipelines.llm - INFO - Max memory configuration: {0: '23GiB', 1: '23GiB', 'cpu': '26GiB'}
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00,  6.13it/s]
2024-09-27 13:33:04,805 - app.pipelines.llm - INFO - Model loaded and distributed. Device map: {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1, 'model.layers.28': 1, 'model.layers.29': 1, 'model.layers.30': 1, 'model.layers.31': 1, 'model.norm': 1, 'model.rotary_emb': 1, 'lm_head': 1}
/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/pydantic/_internal/_fields.py:160: UserWarning: Field "model_id" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
2024-09-27 13:33:04,869 - app.main - INFO - Started up with pipeline LLMPipeline(model_id=meta-llama/Meta-Llama-3.1-8B-Instruct)
2024-09-27 13:33:04,869 INFO:     Application startup complete.
2024-09-27 13:33:04,870 INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

@@ -58,6 +58,11 @@ class TextResponse(BaseModel):
chunks: List[chunk] = Field(..., description="The generated text chunks.")


class LlmResponse(BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this LLMResponse since LLm is a abreviation.

worker/worker.go Outdated
@@ -341,7 +341,7 @@ func (w *Worker) LLM(ctx context.Context, req LlmLlmPostFormdataRequestBody) (in
if err != nil {
return nil, err
}
return w.handleNonStreamingResponse(resp)
Copy link
Collaborator

@rickstaa rickstaa Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also apply the following commits done by @ad-astra-video to your branch?

We're planning to move the other pipelines into their own Docker containers to make the interface more generic and the base container leaner. With that in mind, I think it would make sense to also include the LLM pipeline in this new setup. The orchestrator experience will remain unchanged once PR #200 is merged.

Could you also take a look at this commit to see if it is needed?

Lastly, please add SDK tags where you think they fit, using this commit by @ad-astra-video as an example.

Copy link
Collaborator

@rickstaa rickstaa Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the small naming comment everything else can also be addressed in seperate pull request so approved for now 👍.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the intention of the commits but I have several points of feedback on them. Usually PR feedback is left to the author to complete, and commits aren't added outside of the realm of the author.

  • 240e16d -> bench.py is also in the root folder, it's also not a runtime piece of code but a dev-tool script. Either move both or neither.

  • b6a790d -> I don't really see what this accomplishes other than nuking CI ? Models can be downloaded separately regardless

  • 9e1c48a -> Probably not the most idiomatic way to do this, but fine for now. Slipped through the cracks because I mainly tested with external container.

  • 9461530 -> I made the changes, but this naming convention isn't good, Studio should not say what our naming conventions should be, I know i'm repeating myself here.

Overall there are too many breaking changes being made in this repo that affect open PRs and there needs to be a better culture in getting PRs across the finish line and respecting the dynamics of author and reviewer.

@@ -36,7 +36,8 @@ var containerHostPorts = map[string]string{
"image-to-video": "8200",
"upscale": "8300",
"audio-to-text": "8400",
"segment-anything-2": "8500",
"llm": "8500",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyriediculous Could we keep the order the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLM was added before segment anything and has been using port 8005 prior to this.

Copy link
Collaborator

@rickstaa rickstaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m out of the office and don’t have time for a deep dive, but I took a quick look to keep things moving. I’ve left a few comments, but overall, it looks good to merge 🎉.

Comment on lines 24 to 26
@router.post("/llm",
response_model=LLMResponse, responses=RESPONSES, operation_id="genLLM",)
@router.post("/llm/", response_model=LLMResponse, responses=RESPONSES, include_in_schema=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add metadata for OpenAPI docs&SDK? Here's a suggestion to make it easy:

Suggested change
@router.post("/llm",
response_model=LLMResponse, responses=RESPONSES, operation_id="genLLM",)
@router.post("/llm/", response_model=LLMResponse, responses=RESPONSES, include_in_schema=False)
@router.post(
"/llm",
response_model=LLMResponse,
responses=RESPONSES,
operation_id="llm",
description="Generate text using a language model.",
summary="LLM",
tags=["generate"],
openapi_extra={"x-speakeasy-name-override": "llm"},
)
@router.post(
"/llm/",
response_model=LLMResponse,
responses=RESPONSES,
include_in_schema=False,
)

I changed the operation_id to just llm since Rick mentioned you wanted not to have the gen prefix here. As I mentioned in Discord, this operation_id doesn't matter for the SDK so feel free to skip it in your API if you think it's more important than the consistency.

@rickstaa rickstaa merged commit 6b00498 into livepeer:main Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants