Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keepalive bugfixes and unify timers strategies between client and server #2760

Merged

Conversation

davidfiala
Copy link
Contributor

Improve server-side keepalives, possibly resolve bug where keepalive errors were not being treated as errors.

  • Bugfix: Ensure that if session.ping returns false we correctly identify fail the keepalive and connection
  • Bugfix: Ensure that if the interval between keepalives being sent occurs faster than the prior keepalive's timeout that we do not overwrite the reference to the prior timeout. Before this change: we could have in theory prevented a valid keepalive timeout from clearing itself because it's object reference was replaced with a newer timeout. This rewrite keeps every timeout as a local (vs a shared state per session). Even if the timeout outlives the lifetime of a session, we still guard against errors by checking that the parent interval is not falsy. I reckon this could result in a short-term memory leak per session which is bounded for a maximum of keepaliveTimeoutMs. On the other hand even with that potential for a short reference hold, this implementation proposed here is more correct I think. One alternative we could do is keep a list of pending timeouts.. which is complex for a rare situation that will self resolve anyhow when keepaliveTimeoutMs is reached. So I'm going with the cheaper, more common case.
  • Bug Fix: keepalive intervals were being cleared with an incorrect clearTimeout before. Not sure if this was causing intervals leaks in some nodejs impls or not. (v20.13.1 seems to accept this mismatch without issue)
  • Rename variables for clarity, to prevent future bugs like swapping clearInterval vs clearTimeout.
  • Trace on GOAWAY frames and ensure we drop the connection
  • Implementation is repeated in two places, per warning from Serverside keepalive error detection and cleanups #2756 (comment)
  • This commit supercedes the prior PR on a master branch which was out of date. Serverside keepalive error detection and cleanups #2756

Aims to help resolve #2734

- Bugfix: Ensure that if session.ping returns false we correctly identify fail the keepalive and connection
- Bugfix: Ensure that if the interval between keepalives being sent occurs faster than the prior keepalive's timeout that we do not overwrite the reference to the prior timeout. Prior implementation could have in theory prevented a valid keepalive timeout from clearing itself. This rewrite keeps every timeout as a local (vs a shared state per session). Even if the timeout outlives the lifetime of a session, we still guard against errors by checking that the parent interval is not false-y. I reckon this could result in a short-term memory leak per session which is bounded for a maximum of keepaliveTimeoutMs. On the other hand even with that potential for a short reference hold, this implementation proposed here is more correct I think. One alternative we could do is keep a list of pending timeouts.. which is complex for a rare situation that will self resolve anyhow when keepaliveTimeoutMs is reached.
- Bug Fix: keepalive intervals were being cleared with an incorrect clearTimeout before. Not sure if this was causing intervals leaks in some nodejs impls or not. (v20.13.1 seems to accept this mismatch without issue)
- Rename variables for clarity, to prevent future bugs like swapping clearInterval vs clearTimeout.
- Implementation is repeated in two places, per warning from grpc#2756 (comment)
- This commit supercedes the prior PR on a master branch which was out of date. grpc#2756
@davidfiala
Copy link
Contributor Author

I realize that a second solution to my reference conundrum is to use a WeakRef to hold a reference to session and channelz tracer in the timeout lambda, which might reduce the risk of the short-term memleak size. I'm not sure if the rest of the code base is overly cautious for these types of things. If you'd prefer I track all outstanding timeouts or use a weakref, please let me know. Happy to adjust.

Copy link
Member

@murgatroid99 murgatroid99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All per-line comments apply to both implementations.

channelzTrace.addTrace isn't a replacement for trace. channelzTrace.addTrace only traces to channelz, so that will only be seen by a channelz client. You still want to call trace to log to the console or whatever log file.

Also, keepalive logs may be verbose compared to other server logs, so I think it would be a good idea to have a separate tracer, as the client does. It's probably fine to use the same keepalive tracer name that the client uses.

packages/grpc-js/src/server.ts Outdated Show resolved Hide resolved
packages/grpc-js/src/server.ts Outdated Show resolved Hide resolved
packages/grpc-js/src/server.ts Outdated Show resolved Hide resolved
packages/grpc-js/src/server.ts Show resolved Hide resolved
packages/grpc-js/src/server.ts Outdated Show resolved Hide resolved
packages/grpc-js/src/server.ts Outdated Show resolved Hide resolved
packages/grpc-js/src/server.ts Outdated Show resolved Hide resolved
packages/grpc-js/src/server.ts Outdated Show resolved Hide resolved
@davidfiala
Copy link
Contributor Author

Thanks for your prompt help and code review! I appreciate all the pointers, especially given that this is my first pass through this part of your code base.

  • Addressed your comments as best I could.
  • Unified the tracer name and messages between client and server. The channelz version of the server does both tracing to channelz and keepalive tracer. I kept the original version of the text for channelz.
  • Unfied server and client to both start ping timer only after ping response. Never has PING in parallel now. Matches your request, I believe.
  • As part of unifying the code, I wanted to simplify the conditons to check, and thus went with a single timer reference which is used for both types of timeouts (waiting to send ping and waiting for ping to timeout). I think this makes the code easier to reason about.
  • Unified server to match client: if ping time is negative, then pinging is turned off.
  • Attempted to make the server ping code feel the same as the clients original code style: The server code attempts to mimic the style of the client by using functions for things like starting a ping, starting a timer, clearing a timer. I had to use lambdas (vs class methods in client) though for this given that my context scope was a function rather than class. I rather like it this way though now that I see the finished product.

Question:

  • would you like it for me to trace to channelz when a ping is outgoing? what about when a ping is received successfully? At present, only bad outcomes go to channelz. The original code IIRC didn't log good outcomes to channelz.

Thanks again for the help.

@davidfiala davidfiala changed the title Serverside keepalive error detection and cleanups and GOAWAY handling Keepalive bugfixes and unify timers strategies between client and server May 29, 2024
Copy link
Member

@murgatroid99 murgatroid99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the channelz traces are fine the way they are.

packages/grpc-js/src/transport.ts Outdated Show resolved Hide resolved
packages/grpc-js/src/transport.ts Outdated Show resolved Hide resolved
@davidfiala
Copy link
Contributor Author

Thank you for the review.

I've rewritten to make the server code act more like the client code did for the timers.

I think the only dangling issue for us to align on is the private keepaliveDisabled = false; comment. I know it looks verbose, but unless sure otherwise, I'd prefer to lean against any risk of races. Let me know what you think.

@davidfiala
Copy link
Contributor Author

Per request, keepaliveDisabled state management is removed.

…ger. this is a necessary change to fit with having removed keepaliveDisabled boolean. manually inspected test logs for both server.ts and transport.ts to verify both types of keepalives are operating correctly.
@davidfiala
Copy link
Contributor Author

I've been successfully running the PR branch for many days now on a staging environment with bidi streaming RPCs that sit idle for long periods.

Tweaking the keepalive values on both client and server to be both under 10 minutes or above 10 minutes, I've been able to reproduce the keepalive both (a) working to keep alive without errors, and (b) catching connections that were dropped by intermediate firewalls after a period of idle timeout

...specifically including GCP's 10 minute idle limit firewall ;)

I've used GRPC_TRACE=transport,keepalive GRPC_VERBOSITY=DEBUG to monitor and verify activity under differing configurations to ensure that the keepalives were what was responsible for both (a) keeping alive and (b) detecting silently dropped TCP connections, per above.

@davidfiala
Copy link
Contributor Author

Hi gRPC team: Gentle bump. Let me know if I can provide anything else.

@murgatroid99
Copy link
Member

Sorry about the delay, I had to focus on GHSA-7v5v-9h63-cj86. I can get this out soon.

@murgatroid99 murgatroid99 merged commit 5c0226d into grpc:@grpc/[email protected] Jun 18, 2024
4 of 5 checks passed
@murgatroid99
Copy link
Member

This is now out in version 1.10.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants