Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health Check Callbacks #142

Closed
markmandel opened this issue Mar 15, 2018 · 7 comments
Closed

Health Check Callbacks #142

markmandel opened this issue Mar 15, 2018 · 7 comments
Labels
good first issue These are great first issues. If you are looking for a place to start, start here! help wanted We would love help on these issues. Please come help us! kind/design Proposal discussing new features / fixes and how they should be implemented kind/feature New features for Agones wontfix Sorry, but we're not going to do that.

Comments

@markmandel
Copy link
Member

Impetus

Right now the Health check Ping is very basic - it requires the sdk to do a Health() call within the timeframe that is configured via the GameServer config.

The issue with this is, that means that the game server code needs to know what the configuration is at the GameServer configuration level. If the GameServer configuration ever changes, this means the game server binary and image will need to be recompiled and rebuilt. This is less than ideal.

Design

I would suggest the following solution:

The GameServer container that run the gameserver binary is passed an environment variable (AGONES_HEALTH_PERIOD_SECONDS) that is the configired periodSeconds from GameServer > health.

The SDK (for all languages) would have custom code that would parse the above environment variable, and run a callback every AGONES_HEALTH_PERIOD_SECONDS / 2. If the callback returns true, then the SDK will send the manual ping.

This takes away some of the control around how to run the Healh checking (i.e how to thread, etc), but would be far more convenient for users - and there is no reason we couldn't leave the more manual method if people so desired.

An example API in Go

s, _ := sdk.NewSDK()
// add a health check
s.RunHealth(func() {
  // check returns a boolean if passing or not
  return true
});
agones::SDK *sdk = new agones::SDK();
bool connected = sdk->Connect();

/// no idea what can you even lambda in C++? Is that even canonical C++ Help!

Research

Questions

  • Not sure how to translate this to C++. Help wanted please.
  • Should the API be a single callback, or can you register multiple callbacks. Something like sdk.AddHealthCheck(func() bool). Thinking of leaving at single for now - simple to start with, and see how things go.
@markmandel markmandel added kind/feature New features for Agones kind/design Proposal discussing new features / fixes and how they should be implemented labels Mar 15, 2018
@markmandel
Copy link
Member Author

/cc @sylvainduchesne - I figure you would have some thoughts on this.

This was the design I thought would combat some of your very valid concerns.

I also expect your C++ is way better than mine, so would love your thoughts on how to do a callback in C++ (or if that even make sense)

@sylvainduchesne
Copy link

That seems reasonable.

I pushed this tentative code for the cpp
https://github.com/GoogleCloudPlatform/agones/compare/master...sylvainduchesne:cpphealthcallback?expand=1
on my fork.
There are a few open questions about the health check.
Should there be a bool (healthy|unhealthy) on the check? Because I suppose if there are internal metrics the GS can check, the only way it has to say it is unhealthy (but responsive) is to ignore the health check. It wouldn't be possible to do so either with a callback (well with my current implementation) since the sdk takes care of calling the actual rpc.
Also, I wasn't sure of the actual interval time to respect. Should it be every x seconds, or every x seconds after the last health check occured.
Can the period be modified at runtime?

@sylvainduchesne
Copy link

I was thinking about this over the weekend, and perhaps I was over thinking the problem.
Slight modification to how the callback result is treated:
https://github.com/GoogleCloudPlatform/agones/compare/master...sylvainduchesne:cpphealthcallbackv2?expand=1

I think there is perhaps still some clarification needed around what it means to be unhealthy, as the documentation doesn't really make mention of it (what are the consequences?)

@markmandel
Copy link
Member Author

Just at GDC, so haven't had too much time to look in depth (yet!)
At first glace the overall direction makes sense to me.

Some quick answers/feedback to your questions:

Re: Health checking. The base level for a GameServer, if it become Unhealthy, it's currently up to a managing layer to do something about that (should likely document this better).
You can see an example of this in the design for Fleets. Maybe later if we find health checking is the same across all managing layers, we could potentially pull this down to the GameServer level , but figured this was a good first stab until we see what higher level constructs we end up with.

Health check period won't be modifiable at runtime and we can restrict that at the GameServer level. (Pretty sure it's locked down at a Pod level for sure).

Minor thing that I saw at first pass - I'd suggest taking the health check out of the SDK constructor and move it into some kind of SDK::RunHealth(&check) (not wedded to the name). Since Health checking is optional, we shouldn't force a check on people, and let's also give users some flexibility on where / when they want to start the health check vs starting an instance of the healthcheck - or if they just want to use the really basic SDK::Health() ping if they so desire. WDYT?

I'll definitely go deeper - but overall I think this is going to work out well - just need to work out some of the finer details.

@kwiesmueller
Copy link

Why isn't the Kubernetes native health and liveness feature used for this?
Wouldn't using this make the entire process much more k8s "native"?
For the check period it would then be easy to just define this there and I probably would favor pull over push for this anyways.
Or where did the current design come from?

@sylvainduchesne
Copy link

Hope GDC went well.
I did the code kind of quickly, so nothing set in stone there (eg: health in constructor was simple a convenience).
Oh, I wasn't aware the actual ping was optional. Since I wasn't too sure on the 'what happens when...' part, I figured it wasn't.
In that case, I would agree there could probably be some more flexibility by adding a start/stop health check. If they choose to implement a check, then perhaps that callback should return a bool as suggested by my branch? Otherwise, use the basic ping (which I think is what I coded as well).

@markmandel markmandel added good first issue These are great first issues. If you are looking for a place to start, start here! help wanted We would love help on these issues. Please come help us! labels Sep 25, 2019
@markmandel
Copy link
Member Author

Reviewing old issues, and doing some cleanup.

Since the SDK can now retrieve the GameServer configuration, and adjust health pings if needed I'm going to close this issue.

Please feel free to reopen, or file a new ticket if this is inaccurate.

@markmandel markmandel added the wontfix Sorry, but we're not going to do that. label Jan 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue These are great first issues. If you are looking for a place to start, start here! help wanted We would love help on these issues. Please come help us! kind/design Proposal discussing new features / fixes and how they should be implemented kind/feature New features for Agones wontfix Sorry, but we're not going to do that.
Projects
None yet
Development

No branches or pull requests

3 participants