Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batched Packed and Distributed Allocations #804

Merged
merged 5 commits into from
Jun 12, 2019

Conversation

markmandel
Copy link
Member

This implements a batching algorithm for Packed and Distributed GameServerAllocations (GSA).

This ensures that we can still get high throughout on GSA's, while retaining a tight packing of Allocated GameServers on each node with the Packed strategy.

The Distributed strategy is now implemented with a randomised Allocation process, so ensure distributed load under this strategy.

Closes #783

@markmandel markmandel added kind/feature New features for Agones area/performance Anything to do with Agones being slow, or making it go faster. labels Jun 4, 2019
list = nil
requestCount = 0
// slow down cpu churn, and allow items to batch
time.Sleep(batchWaitTime)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @jkowalski @roberthbailey @ilkercelikyilmaz

Curious to see your take on the Sleep approach above. I wanted to avoid nested select statements as much as possible, to keep the code as simple/understandable as I could (since it's lots of async processing) - and this seemed like the best way forward.

Would love your feedback!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When queue depth is less than 100 (maxBatchQueueSize) and we don't have any more requests, how long will the allocations in the queue will wait?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The longest they will wait <= batchWaitTime (500ms) because we might be in the sleep code) - but that will only be for the first request. After that, each request in the batch (c.pendingRequests) is processed immediately.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 739e91f8-12ac-4a20-875a-b72964b79b07

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Member Author

@ilkercelikyilmaz - likely unsurprisingly, would love it if you could run your long running perf tests against this, see how it holds up 😄 🤞

@markmandel
Copy link
Member Author

markmandel commented Jun 4, 2019

Also realised - this fixes a bug in the current GameServerAllocation that has yet to rear it's head. The currently incarnation searches for GameServers in any namespace (:fearful:) - this implementation will only search for GameServers in the namespace the GameServerAllocation is created in - as it really should be.

@pooneh-m we should talk about this, because this new logic looks to break the current installation of the allocator service (since it's installed in agones-system and therefore creates the GameServerAllocation there -- which worked when GSA wasn't namespace aware), which I don't think was what was intended? Wanted to talk to you about your intent there before submitting any PRs to fix.

@ilkercelikyilmaz
Copy link
Contributor

@ilkercelikyilmaz - likely unsurprisingly, would love it if you could run your long running perf tests against this, see how it holds up

I will run the tests. Apparently I ran only Distrubuted before. I will run both of them.

func (c *Controller) runLocalAllocations(updateWorkerCount int) {
updateQueue := make(chan response)
// setup workers for allocation updates
c.allocationUpdateWorkers(updateQueue, updateWorkerCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be more readable to create updateQueue, launch workers and pass the queue to this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear - do you mean, pass the queue to runLocalAllocations(...) ?

Or do you mean that c.allocationUpdateWorkers should return the upateQueue (which I also quite like - since they are quite coupled together, an we could return an insert only channel from the function - which is tighter encapsulation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this to return the channel, I think this is better, and feel more like a canonical go construct to me.

key, _ := cache.MetaNamespaceKeyFunc(gs)
if ok := c.readyGameServers.Delete(key); !ok {
// this seems unlikely, but lets handle it just in case
req.response <- response{request: req, gs: nil, err: ErrConflictInGameServerSelection}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who's going to retry this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets retried here - which is the original retry logic that @ilkercelikyilmaz implemented.

var list []*stablev1alpha1.GameServer
requestCount := 0

for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this deserves a lengthy comment explaining what's going on

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Hope it makes sense 👍

list = c.listSortedReadyGameServers()
}

gs, index, err := findGameServerForAllocation(req.gsa, list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand where you're doing the batching.

If I'm reading the code correctly, it's marshaling all requests through a single thread, which is equivalent to a global mutex.

With batching you would have the following code structure:

for {
   batch := getNextBatch();
  // read from c.pendingRequests until it blocks, but no less than 1 item
  allocations := processBatch(requests)
  updateAllGameServers(allocations)
  sendResponses()
}

You need to exploit the fact that you're allocating N items and not one by one, to counter the effects of serialization.

Back to my "bus" analogy - once all requests are on the bus, you see what game servers are available (once), sort them (once) and as people leave the bus you give each exiting passenger a game server from the list and mark it as Allocated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So c.pendingRequests is our bus - when we see we have items on that we start batching.

once all requests are on the bus, you see what game servers are available (once),
sort them (once)

Yep, we do that. If list is null, we grab the list of game servers and sort them. We then keep this list for a while (I figured 100 allocations was a good size, or until no more are coming in for the batch/c.pendingRequests)

Then we loop through the requests in the batch/bus (through c.pendingRequests), which is our bus, and match gameservers to the GSA's selectors.

Then we send them all to the updateQueue where 100 workers update each gameserver separately to Allocated, and send the response back to the original request concurrently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we came to conclusion that this was not batching exactly 😄 but you had a wonderful analogy (if i remember it correct).

It's a lemonade stand in which the seller makes 100 cups of lemonade up front, and waits for customers to come through, so they can deliver it to customers more quickly once they place their order.

It's basically the fast food for gameserver allocation 😆 🍔 (or at least how it was before they started "making it fresh").

In future PR's we could look at groupinng the same GSA's together from the queue, and processing them in one go (rather than looping for each one), but looks like performance is "good enough" for now.

This implements a batching algorithm for Packed and Distributed
GameServerAllocations (GSA).

This ensures that we can still get high throughout on GSA's, while
retaining a tight packing of Allocated GameServers on each node with
the Packed strategy.

The Distributed strategy is now implemented with a randomised Allocation
process, so ensure distributed load under this strategy.

Closes googleforgames#783
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: bf8f07cf-6c2e-4ab8-87d1-bfc784724212

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 6c4a965e-a3ea-4374-b6fe-659b05bbca4f

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@ilkercelikyilmaz
Copy link
Contributor

ilkercelikyilmaz commented Jun 5, 2019

@ilkercelikyilmaz - likely unsurprisingly, would love it if you could run your long running perf tests against this, see how it holds up

I will run the tests. Apparently I ran only Distrubuted before. I will run both of them.
I ran the test for Distributed and it looks good.
image

One interesting observation is that the allocation throughput is no consistent. It varies between 120-210 allocation/sec (see below). This might caused by GKE auto scaling or load test client being impacted by some other processes running on my machine.
image

I will run the packed but as I mentioned I don't have a baseline.

@markmandel
Copy link
Member Author

Can we compare it to what the current implementation is doing - I think we defined that performance as "good enough".?

@pooneh-m
Copy link
Contributor

pooneh-m commented Jun 5, 2019

@pooneh-m we should talk about this, because this new logic looks to break the current installation of the allocator service (since it's installed in agones-system and therefore creates the GameServerAllocation there -- which worked when GSA wasn't namespace aware), which I don't think was what was intended? Wanted to talk to you about your intent there before submitting any PRs to fix.

Yes we should fix this in the allocator to remove the logic that gets the namespace from the environment.
I created an issue: #809

@ilkercelikyilmaz
Copy link
Contributor

Can we compare it to what the current implementation is doing - I think we defined that performance as "good enough".?

Here are the Packed results(see below). Somehow the the servers do not start as fast so we can't get back to 7000 servers (the number of replicas) between the tests.
image

The allocations throughput is not bad (not as good as distributed as expected). They vary between 80-160.
image

@markmandel
Copy link
Member Author

I expect scheduling is slowing down pod creating in packed. Once we can move to 1.12, this should improve considerably (there is a massive scheduler improvement in 1.12).

@markmandel
Copy link
Member Author

That seems like decent performance - and actually higher than what I was seeing through the locust tests (I was getting ~67 allocations per second - but I only ran the one test to 1400 allocations, yours is likely far more accurate and comprehensive due to its long running nature).

At the low end (packed, 80 Allocations a second), that 4800 gameservers allocated per minute. Seems like a good starting number - and we have nice and tightly packed allocations now as well!

@ilkercelikyilmaz
Copy link
Contributor

Can we compare it to what the current implementation is doing - I think we defined that performance as "good enough".?

Here are the Packed results(see below). Somehow the the servers do not start as fast so we can't get back to 7000 servers (the number of replicas) between the tests.
image

The allocations throughput is not bad (not as good as distributed as expected). They vary between 80-160.
image

I ran the Packed with the latest version from master and results seems comparable.
image

The allocation throughput is also between 80-150 per sec.
image

@markmandel
Copy link
Member Author

markmandel commented Jun 5, 2019

I ran the Packed with the latest version from master and results seems comparable.

Awesome. That is super promising.

Will need to resolve the issue #809 to be resolved so the e2e tests can pass.

I think I can also make this loop construct far simpler as well (do a copy and a shuffle, rather than this index shenanigans) - I need to double check that I don't accidentally shuffle the cached list. I should be able to do comparative locust tests to see if there is any performance difference.

Nope lied. There was method to my original madness. Ignore the previous comment.

@ilkercelikyilmaz
Copy link
Contributor

I ran the Packed with the latest version from master and results seems comparable.

Awesome. That is super promising.

Will need to resolve the issue #809 to be resolved so the e2e tests can pass.

I think I can also make this loop construct far simpler as well (do a copy and a shuffle, rather than this index shenanigans) - I need to double check that I don't accidentally shuffle the cached list. I should be able to do comparative locust tests to see if there is any performance difference.

Nope lied. There was method to my original madness. Ignore the previous comment.

I agree that performance of this change is still acceptable (if not better).

@markmandel
Copy link
Member Author

I agree that performance of this change is still acceptable (if not better).

Fantastic. I'm waiting on #812 and #814 to go through, and then the tests on this should pass.

If anyone has commentary on the code, would definitely appreciate it 👍

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 935e1d8c-4c55-4d59-8389-d142e4d93bb3

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/804/head:pr_804 && git checkout pr_804
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.11.0-60639f4

@ilkercelikyilmaz
Copy link
Contributor

I agree that performance of this change is still acceptable (if not better).

Fantastic. I'm waiting on #812 and #814 to go through, and then the tests on this should pass.

If anyone has commentary on the code, would definitely appreciate it 👍

This looks good to me. I can approve it but since Jarek has some comments, I don't want to approve it until he is OK with your changes/responses.
@jkowalski , What do you think?

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 93cf8fef-c4cb-45a4-8860-e722b4d87f90

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/804/head:pr_804 && git checkout pr_804
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.11.0-bfb7b53

@markmandel markmandel merged commit 8e40cd2 into googleforgames:master Jun 12, 2019
@markmandel markmandel deleted the feature/batch-allocation branch June 12, 2019 17:59
@markmandel markmandel added this to the 0.11.0 milestone Jun 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Anything to do with Agones being slow, or making it go faster. kind/feature New features for Agones
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Packed Allocation is very not packed
5 participants