Add LLM Pipeline #137

kyriediculous · 2024-07-30T23:49:44Z

No description provided.

ad-astra-video · 2024-09-21T05:33:07Z

@rickstaa I have reviewed this and confirmed it works. Code needed to be rebased with new code gen updates from recent SDK releases. @kyriediculous can update this PR or we can move to the other PR.

Some brief research provided there are other implementations to serve LLM pipelines which was also briefly discussed with @kyriediculous. Settled on alternative implementations can be researched and tested if the need arises from user feedback. LLM SPE will continue to support and enhance this pipeline to suite the network requirements for the LLM pipeline as the network evolves.

Notes from review/testing:

I like the streamed response simply starting a second thread to do the inference using a pre-built text streamer from transformers library to send the text chunks back. Note the api for this class may change in the future per note in the transformers documentation .

There was only a couple small changes I made in addition to the changes needed to rebase this PR:

Moved the check_torch_cuda.py to the dev folder since it only provides a helper to check cuda version.
Fixed the logic on returning containers for managed containers. For streamed responses the container was returned right after the streamed response was started. This would allow another request to come in to the GPU and would potentially significantly slow down the first request that was still processing. I would suggest we start with 1 request in flight per GPU for managed containers and target a future enhancement to increase this with thorough testing and documentation of multiple requests in flight on one GPU can be completed timely.
- Note, external containers are not limited to this one request in flight at a time. It is expected that external containers have their own load balancing logic and return 500 error when overloaded. Also, the external containers start to slow down token/second as each request is added concurrently. I experienced connections closing/timing out when overloading a GPU to much when testing locally with 5 concurrent requests on 3080.

kyriediculous · 2024-09-27T13:33:57Z

All comments have been addressed and commit history has been cleaned up

2024-09-27 13:32:48,339 INFO:     Started server process [1]
2024-09-27 13:32:48,339 INFO:     Waiting for application startup.
2024-09-27 13:32:55,774 - app.pipelines.llm - INFO - Local model path: /models/models--meta-llama--Meta-Llama-3.1-8B-Instruct
2024-09-27 13:32:55,774 - app.pipelines.llm - INFO - Directory contents: ['snapshots', 'refs', 'blobs']
2024-09-27 13:32:55,774 - app.pipelines.llm - INFO - Using fp16/bf16 precision
2024-09-27 13:32:55,798 - app.pipelines.llm - INFO - Max memory configuration: {0: '23GiB', 1: '23GiB', 'cpu': '26GiB'}
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00,  6.13it/s]
2024-09-27 13:33:04,805 - app.pipelines.llm - INFO - Model loaded and distributed. Device map: {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1, 'model.layers.28': 1, 'model.layers.29': 1, 'model.layers.30': 1, 'model.layers.31': 1, 'model.norm': 1, 'model.rotary_emb': 1, 'lm_head': 1}
/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/pydantic/_internal/_fields.py:160: UserWarning: Field "model_id" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
2024-09-27 13:33:04,869 - app.main - INFO - Started up with pipeline LLMPipeline(model_id=meta-llama/Meta-Llama-3.1-8B-Instruct)
2024-09-27 13:33:04,869 INFO:     Application startup complete.
2024-09-27 13:33:04,870 INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

rickstaa · 2024-09-27T15:38:55Z

runner/app/routes/util.py

@@ -58,6 +58,11 @@ class TextResponse(BaseModel):
    chunks: List[chunk] = Field(..., description="The generated text chunks.")


+class LlmResponse(BaseModel):


Can we make this LLMResponse since LLm is a abreviation.

rickstaa · 2024-09-27T16:08:24Z

worker/worker.go

@@ -341,7 +341,7 @@ func (w *Worker) LLM(ctx context.Context, req LlmLlmPostFormdataRequestBody) (in
 	if err != nil {
 		return nil, err
 	}
-	return w.handleNonStreamingResponse(resp)


Could you also apply the following commits done by @ad-astra-video to your branch?

240e16d

b6a790d

241cf7a

We're planning to move the other pipelines into their own Docker containers to make the interface more generic and the base container leaner. With that in mind, I think it would make sense to also include the LLM pipeline in this new setup. The orchestrator experience will remain unchanged once PR #200 is merged.

Could you also take a look at this commit to see if it is needed?

9e1c48a

Lastly, please add SDK tags where you think they fit, using this commit by @ad-astra-video as an example.

Apart from the small naming comment everything else can also be addressed in seperate pull request so approved for now 👍.

I get the intention of the commits but I have several points of feedback on them. Usually PR feedback is left to the author to complete, and commits aren't added outside of the realm of the author.

240e16d -> bench.py is also in the root folder, it's also not a runtime piece of code but a dev-tool script. Either move both or neither.

b6a790d -> I don't really see what this accomplishes other than nuking CI ? Models can be downloaded separately regardless

9e1c48a -> Probably not the most idiomatic way to do this, but fine for now. Slipped through the cracks because I mainly tested with external container.

9461530 -> I made the changes, but this naming convention isn't good, Studio should not say what our naming conventions should be, I know i'm repeating myself here.

Overall there are too many breaking changes being made in this repo that affect open PRs and there needs to be a better culture in getting PRs across the finish line and respecting the dynamics of author and reviewer.

rickstaa · 2024-09-27T16:09:43Z

worker/docker.go

@@ -36,7 +36,8 @@ var containerHostPorts = map[string]string{
 	"image-to-video":     "8200",
 	"upscale":            "8300",
 	"audio-to-text":      "8400",
-	"segment-anything-2": "8500",
+	"llm":                "8500",


@kyriediculous Could we keep the order the same?

LLM was added before segment anything and has been using port 8005 prior to this.

rickstaa

I’m out of the office and don’t have time for a deep dive, but I took a quick look to keep things moving. I’ve left a few comments, but overall, it looks good to merge 🎉.

victorges · 2024-09-30T17:16:53Z

runner/app/routes/llm.py

+@router.post("/llm",
+             response_model=LLMResponse, responses=RESPONSES, operation_id="genLLM",)
+@router.post("/llm/", response_model=LLMResponse, responses=RESPONSES, include_in_schema=False)


Can you add metadata for OpenAPI docs&SDK? Here's a suggestion to make it easy:

Suggested change

@router.post("/llm",

response_model=LLMResponse, responses=RESPONSES, operation_id="genLLM",)

@router.post("/llm/", response_model=LLMResponse, responses=RESPONSES, include_in_schema=False)

@router.post(

"/llm",

response_model=LLMResponse,

responses=RESPONSES,

operation_id="llm",

description="Generate text using a language model.",

summary="LLM",

tags=["generate"],

openapi_extra={"x-speakeasy-name-override": "llm"},

)

@router.post(

"/llm/",

response_model=LLMResponse,

responses=RESPONSES,

include_in_schema=False,

)

I changed the operation_id to just llm since Rick mentioned you wanted not to have the gen prefix here. As I mentioned in Discord, this operation_id doesn't matter for the SDK so feel free to skip it in your API if you think it's more important than the consistency.

kyriediculous marked this pull request as ready for review July 31, 2024 00:31

kyriediculous requested a review from rickstaa as a code owner July 31, 2024 00:31

kyriediculous force-pushed the llm branch 8 times, most recently from dcdbe27 to 1f4c952 Compare July 31, 2024 19:38

kyriediculous mentioned this pull request Aug 1, 2024

LLM pipeline with stream support livepeer/go-livepeer#3114

Merged

5 tasks

kyriediculous force-pushed the llm branch 2 times, most recently from ad44973 to 87bfe3f Compare August 5, 2024 18:17

rickstaa mentioned this pull request Aug 13, 2024

Implement LLM pipeline at the AI runner side [60 LPT] livepeer/bounties#41

Closed

ad-astra-video mentioned this pull request Sep 23, 2024

Livepool llm rebased #210

Closed

kyriediculous added 7 commits September 27, 2024 13:10

runner: add llm-generate route and pipeline

409ad51

add llama3.1 8B to downloads

6e1ed29

worker: add llm-generate container management

46f4515

llm: support streamed responses

8e6dfc6

Load LLM model distributed over multiple GPUs

f5f7d66

feat: support 8bit and fp16 for llm pipeline

5df016b

rename llm pipeline and route from 'llm-generate' to 'llm'

d577b74

kyriediculous force-pushed the llm branch from 401dbbe to d577b74 Compare September 27, 2024 13:33

kyriediculous added 2 commits September 27, 2024 15:39

fix: move check_torch_cuda.py script to check versions to 'dev' folder

653ff2b

fix: return container on non-streaming response

e488aa4

thomshutt requested a review from leszko September 27, 2024 14:34

rickstaa reviewed Sep 27, 2024

View reviewed changes

rickstaa requested changes Sep 27, 2024

View reviewed changes

rickstaa approved these changes Sep 27, 2024

View reviewed changes

kyriediculous added 3 commits September 27, 2024 22:33

fix update naming convention 'Llm' -> 'LLM'

53f326e

fix: return container after stream completes

22fbcf0

update naming conventions for SDK and go bindings

e02ef21

rickstaa requested review from victorges and hthillman September 28, 2024 09:04

victorges reviewed Sep 30, 2024

View reviewed changes

kyriediculous added 3 commits September 30, 2024 22:16

rename LlmGenerateMultipartWriter -> LLMMultipartWriter

39e082e

openapi tags

620f8b0

openapi bindings and spec with improved tags and description

f7521c3

rickstaa approved these changes Sep 30, 2024

View reviewed changes

rickstaa merged commit 6b00498 into livepeer:main Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLM Pipeline #137

Add LLM Pipeline #137

kyriediculous commented Jul 30, 2024

ad-astra-video commented Sep 21, 2024 •

edited

Loading

kyriediculous commented Sep 27, 2024

rickstaa Sep 27, 2024

rickstaa Sep 27, 2024 •

edited

Loading

rickstaa Sep 27, 2024 •

edited

Loading

kyriediculous Sep 27, 2024

rickstaa Sep 27, 2024

kyriediculous Sep 27, 2024

rickstaa left a comment

victorges Sep 30, 2024

		@@ -58,6 +58,11 @@ class TextResponse(BaseModel):
		chunks: List[chunk] = Field(..., description="The generated text chunks.")


		class LlmResponse(BaseModel):

Add LLM Pipeline #137

Add LLM Pipeline #137

Conversation

kyriediculous commented Jul 30, 2024

ad-astra-video commented Sep 21, 2024 • edited Loading

kyriediculous commented Sep 27, 2024

rickstaa Sep 27, 2024

Choose a reason for hiding this comment

rickstaa Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

rickstaa Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

kyriediculous Sep 27, 2024

Choose a reason for hiding this comment

rickstaa Sep 27, 2024

Choose a reason for hiding this comment

kyriediculous Sep 27, 2024

Choose a reason for hiding this comment

rickstaa left a comment

Choose a reason for hiding this comment

victorges Sep 30, 2024

Choose a reason for hiding this comment

ad-astra-video commented Sep 21, 2024 •

edited

Loading

rickstaa Sep 27, 2024 •

edited

Loading

rickstaa Sep 27, 2024 •

edited

Loading