-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote AI Worker (Livepool version) #3106
Closed
kyriediculous
wants to merge
47
commits into
livepeer:ai-video
from
Livepool-io:remote-ai-worker-rebased
Closed
Remote AI Worker (Livepool version) #3106
kyriediculous
wants to merge
47
commits into
livepeer:ai-video
from
Livepool-io:remote-ai-worker-rebased
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
kyriediculous
changed the title
Merge check
Remote AI Worker (Livepool version) merge check
Jul 27, 2024
kyriediculous
changed the title
Remote AI Worker (Livepool version) merge check
[DRAFT] Remote AI Worker (Livepool version) merge check
Jul 28, 2024
kyriediculous
force-pushed
the
remote-ai-worker-rebased
branch
from
July 30, 2024 03:14
4258462
to
0410787
Compare
kyriediculous
changed the title
[DRAFT] Remote AI Worker (Livepool version) merge check
Remote AI Worker (Livepool version)
Jul 30, 2024
rickstaa
force-pushed
the
ai-video-rebase
branch
2 times, most recently
from
August 2, 2024 10:09
4d54872
to
8e654d7
Compare
Closing this one since an alternative Remote Worker implementation was merged: #3168 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this pull request do? Explain your changes. (required)
This PR adds a Remote AI worker to the go-livepeer repo, integrated on the transcoder. Its control flow and networking layer are akin to that of the remote O+T setup.
It allows an Orchestrator to run in standalone mode, and dispatch AI jobs to connected standalone transcoders in which the AI worker is integrated. Remote AI workers register to one or more orchestrators over gRPC and receive a one-directional stream in return, on which AI job notifications are received. The transcoder runs an AI worker and runner to perform the tasks. Once a job is completed, remote AI workers will POST the results back to the orchestrator.
Remote Orchestrator+AI Worker Specification
Abstract
This document outlines the specification for integrating a remote AI worker setup with an Orchestrator, akin to the existing remote Orchestrator + Transcoder configuration. The goal is to enable dynamic orchestrator capabilities while maintaining flexibility for other potential use cases, such as public pools.
Architecture
The architecture mirrors that of the remote transcoder setup for an orchestrator. Instead of hosting AI workers locally, the orchestrator will function as a message broker and session manager, delegating tasks to remote workers.
Design Goals
O<>W Wire Protocol
The communication between the orchestrator (O) and remote AI workers, integrated in the transcoder node (W), will use gRPC to register workers and notify them of new AI jobs over an open stream. The orchestrator will also run an HTTP server to receive results from remote workers. This approach aligns with the existing remote transcoding setup, facilitating ease of understanding and maintenance. Using the same networking stack eliminates the need for additional servers or microservices.
One huge benefit of this design is that the remote AI worker doesn't have to run a server and purely acts as a gRPC and HTTP RPC client. This means that the host wouldn't need to worry about port forwarding, which is of significant importance for public pools.
gRPC
RegisterAIWorker
The existing
Transcoder
service will be extended with a new RPC methodRegisterAIWorker
to register remote AI workers, specifying their capabilities, and maintaining an open connection to receive AI tasks.NotifyAIJob
Remote AI workers will receive AI tasks through the
NotifyAIJob
stream. Upon completion, results will be sent back to the orchestrator via HTTP.HTTP
/aiResults
A new route on the orchestrator's HTTP server to receive AI task results from remote AI workers.
The route expects the following headers:
Authorization
:"Livepeer-Transcoder-1.0"
Credentials
: orchestrator's transcoder secretThe POST request body should be JSON, marshaling into a
RemoteAIWorkerResult
.Data
intostring
, not the wrapped interfaceerror
type)Orchestrator
RemoteAIWorkerManager
The
RemoteAIWorkerManager
is a newly implemented class that adheres to the AI interface. It is responsible for managing connections with remote workers, tracking ongoing tasks, and routing tasks to the appropriate remote workers.Connection Management
Remote workers establish a client->server connection by sending a
RegisterAIWorker
request over gRPC. Active workers are kept in theliveWorkers
mapping and removed upon connection loss.Capability Management
On startup, the orchestrator specifies pipelines and models it wishes to have supported by its remote AI workers, along with a price if the set-up is on-chain.
Capability management works similar to tracking capacity management for remote transcoding. When a remote AI worker is registered, its supported AI capabilities and constraints will be incremented by '1' on the orchestrator.
An orchestrator has an AI capability if its capacity>0.
Task Management
For each AI task, a channel is created, identified by an incrementing TaskID (similar to taskIDs and channels for remote transcoding). When the orchestrator receives AI task results via HTTP, the TaskID retrieves the corresponding channel to forward the result back to the original subtrate that made the request.
Selection
Selection is performed on a round-robin basis, maintaining a queue. Workers are filtered based on the requested pipeline and model. The first worker in the filtered array is selected for the job. If it fails or is at capacity, it is moved to the back of the queue, and the task is retried with the next worker.
Later more elaborate selection strategies could be implemented that consider some kind of scoring strategy, though this doesn't currently exist for a normal remote transcoding setup either, as for a private setup poor performing transcoders can be simply taken out of rotation.
Retries
When a Remote AI worker fails a job, we rotate to the next worker in the
remoteWorkers
array for the givenpipeline
andmodels
. This process is repeated until the context is cancelled, all workers have been tried, or a succesful job has been returned to the Orchestrator.RemoteAIWorker
Is a class that is used to manage remote worker state on the
RemoteAIWorkerManager
. It mainly contains information on the remote worker's host address and capabilities.Remote AI Worker (transcoder)
The remote AI worker will be responsible for spinning up containers to perform AI tasks through the AI worker and runner.
It connects to the orchestrator via a gRPC client, maintaining an open stream for job requests. The worker translates
NotifyAIJob
messages into a format suitable for the AI worker pipeline and sends results back to the orchestrator asRemoteAIWorkerResult
over HTTP.Local setup
Orchestrator
Remote Transcoder / AI node
Gateway
Example request
How did you test each of these updates (required)
Does this pull request close any open issues?
Checklist:
make
runs successfully./test.sh
pass