fix(ai): update ai-video selection suspension #3033

ad-astra-video · 2024-04-28T14:22:56Z

What does this pull request do? Explain your changes. (required)
Draft of ai-video selection algo fix.

Suspension was not working because the penalty was always 3. This logic was a carryover from transcoding where the suspender always started at a refresh count of 0 because a new session manager was created with each stream. For AI, we are reusing the session manager and the suspender so the refresh count does not reset between requests. The fix to suspension is to consider the current refresh count when calculating the penalty so it is 3 more than the current refresh count in the suspender.

There was also an issue where the discoveryPoolSize was always 100 and with limited orchestrators providing models a refresh of sessions was being done with every request. I added an initialPoolSize field to track the last refresh pool size to use with the shouldRefreshSessions logic rather than 100. This stabilizes the suspender to allow more orchestrators to be tried with each Select call.

Last update was moving the signalRefresh() for the suspender that increments the refresh counter in the suspender to the Refresh function makes it more stable that every time we refresh sessions we add to the suspender refresh count

Happy to segregate some of these changes to separate PRs. The suspension fixes can be added separately without dependency on ai-worker PR.

Specific updates (required)

Updates suspender to use the current refresh count of the suspender in the selector.
Moves penalty to the AISessionSelector to make it easier to update and available for calculations on the suspension needed
releases all Os when there are none in the warm and cold pool
Adds option to not use managed containers.

How did you test each of these updates (required)

I have been running these updates on my gateway. Tested 1-200 requests with 5-10 workers sending to gateway. All completed with 1-2 orchestrators providing Bytedance model.

Does this pull request close any open issues?

Checklist:

Read the contribution guide
make runs successfully
All tests in ./test.sh pass
README and other documentation updated
Pending changelog updated

victorges

Feel like I don't have context to officially approve this, but left some comments. Only nits tho, the implementation makes sense for the PR description.

victorges · 2024-09-18T19:46:49Z

server/ai_session.go

-	// as well
+	// as well.  Since AISessionManager re-uses the pools the suspension
+	// penalty needs to consider the current suspender count to set the penalty
+	last_count, ok := pool.suspender.list[sess.Transcoder()]


nit/lint: Vars in go should be camelCase

victorges · 2024-09-18T19:50:20Z

server/ai_session.go

 		// Refresh if the # of sessions across warm and cold pools falls below the smaller of the maxRefreshSessionsThreshold and
 		// 1/2 the total # of orchs that can be queried during discovery


This comment seems out of place now, can you move it closer to L247?

victorges · 2024-09-18T19:51:02Z

server/ai_session.go

@@ -222,7 +233,17 @@ func (sel *AISessionSelector) Select(ctx context.Context) *AISession {
 	shouldRefreshSelector := func() bool {
 		// Refresh if the # of sessions across warm and cold pools falls below the smaller of the maxRefreshSessionsThreshold and
 		// 1/2 the total # of orchs that can be queried during discovery
-		discoveryPoolSize := sel.node.OrchestratorPool.Size()
+		discoveryPoolSize := int(math.Min(float64(sel.node.OrchestratorPool.Size()), float64(sel.initialPoolSize)))


Why do we need this Min now? Can the pool grow from its initial size?

leszko · 2024-09-23T11:22:40Z

server/ai_session.go

+	// penalty needs to consider the current suspender count to set the penalty
+	last_count, ok := pool.suspender.list[sess.Transcoder()]
+	if ok {
+		penalty = pool.suspender.count - last_count + pool.penalty


I'm a little lost with this suspension logic. So, I see that:

the pool.suspender.count is increased every time signalRefresh() is called

pool.penalty is always set to 3

last_count is always set to suspender.count + 3

So, that logic would mean that we're not taking the suspended orchestrator until 3 times the signalRefresh() is called. Is this the idea of this suspension mechanism? That we don't allow the given O to get selected in the 3 refresh sessions?

leszko · 2024-09-23T11:25:10Z

server/ai_session.go

+			// if there are no orchestrators in the pools
+			clog.Infof(ctx, "refreshing sessions, no orchestrators in pools")
+			for i := 0; i < sel.penalty; i++ {
+				sel.suspender.signalRefresh()


release all orchestrators

shouldn't we then just remove them from the suspender.list() rather than calling signalRefresh()? My understanding is that if penalty = 3, then we would need to call signalRefresh() 3 times in order to "release all orchestrators from suspension".

github-actions bot added the AI Issues and PR related to the AI-video branch. label Apr 28, 2024

This was referenced Apr 30, 2024

Allow Gateways to Specify the Selection retry Timeout #3037

Closed

Fix external containers livepeer/ai-worker#72

Closed

ad-astra-video force-pushed the ai-video-fix-selection-pr branch 2 times, most recently from 494b5d9 to 2504355 Compare May 7, 2024 11:24

ad-astra-video force-pushed the ai-video-fix-selection-pr branch from 2504355 to 959ae10 Compare July 20, 2024 10:54

ad-astra-video marked this pull request as ready for review July 22, 2024 12:19

ad-astra-video requested a review from rickstaa as a code owner July 22, 2024 12:19

ad-astra-video added 3 commits July 22, 2024 07:22

move signalRefresh() to Refresh

1e274e9

add log line for session selected

a725208

fix suspension

d94d62b

ad-astra-video force-pushed the ai-video-fix-selection-pr branch from 187dcd4 to d94d62b Compare July 22, 2024 12:23

ad-astra-video changed the title ~~Ai video fix selection pr~~ fix(ai): update ai-video selection suspension Aug 27, 2024

Merge branch 'ai-video' into ai-video-fix-selection-pr

b9e0fe2

rickstaa mentioned this pull request Aug 29, 2024

Call-01 Agenda - 2024-08-29 livepeer/project-management#73

Open

fix penalty def and comment

b965778

victorges reviewed Sep 18, 2024

View reviewed changes

leszko reviewed Sep 23, 2024

View reviewed changes

Merge branch 'ai-video' into ai-video-fix-selection-pr

6dae336

rickstaa force-pushed the ai-video branch from 4a66b22 to 2c50134 Compare October 21, 2024 09:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ai): update ai-video selection suspension #3033

fix(ai): update ai-video selection suspension #3033

ad-astra-video commented Apr 28, 2024 •

edited

Loading

victorges left a comment

victorges Sep 18, 2024

victorges Sep 18, 2024

victorges Sep 18, 2024

leszko Sep 23, 2024

leszko Sep 23, 2024

		// Refresh if the # of sessions across warm and cold pools falls below the smaller of the maxRefreshSessionsThreshold and
		// 1/2 the total # of orchs that can be queried during discovery

fix(ai): update ai-video selection suspension #3033

Are you sure you want to change the base?

fix(ai): update ai-video selection suspension #3033

Conversation

ad-astra-video commented Apr 28, 2024 • edited Loading

victorges left a comment

Choose a reason for hiding this comment

victorges Sep 18, 2024

Choose a reason for hiding this comment

victorges Sep 18, 2024

Choose a reason for hiding this comment

victorges Sep 18, 2024

Choose a reason for hiding this comment

leszko Sep 23, 2024

Choose a reason for hiding this comment

leszko Sep 23, 2024

Choose a reason for hiding this comment

ad-astra-video commented Apr 28, 2024 •

edited

Loading