-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: add panic mode for ManagedChannelImpl #4023
Conversation
Channel enters this mode whenever there is an uncaught throwable from its ChannelExecutor, which is where most channel internal states are mutated, such as load-balancing. In panic mode, the channel will always report TRANSIENT_FAILURE as its state, and will fail RPCs with an INTERNAL error code with the uncaught throwable as the cause, which is helpful for investigating bugs within gRPC and 3rd-party LoadBalancer implementations.
caught by ChannelExecutor instead of routing it back to LoadBalancer.
return delayedTransport; | ||
SubchannelPicker pickerCopy = subchannelPicker; | ||
PickResult pickResult = panicPickResult; | ||
if (pickResult == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not wild that we have yet another variable of state to consider here. I'd rather something closer to one of these:
- Set
subchannelPicker
to the dropping picker. And then I guess put a check for panic mode inexitIdleMode()
. - Actually go SHUTDOWN, but plumb in the failing Status into delayedTransport so RPCs would still fail with useful message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Option 1, how do you check for panic mode without having a state dedicated to panic mode?
For Option 2, do you mean we will call shutdown()
(or more likely shutdownNow()
)? The shutdown mechanism may be a little too sophisticated for entering panic mode. What if the bug is in the shutdown code? IMHO, the panic mode should be implemented in a way as simple as possible, so that it is the last functionality to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For option 1, we could have a boolean panicMode
that is only needed by exitIdleMode()
(or similar). What I don't want is the variable to add another dimension of the various states that we have to consider frequently.
For option 2, at the very least I think we should:
- Set the
shutdown
AtomicBoolean to true - Transition
channelStateManager
to SHUTDOWN - Shut down delayedTransport
Shutting down delayedTransport is the most-risky of those the throw an exception, but it does the important stuff first (set shutdownStatus). So as long as we prevent loops (say, by doing nothing if shutdown.get() == true
), then that seems fine. And looking at shutdown()
, it basically does exactly that. The only extra thing it does is cancelIdleTimer()
. So that seems quite nice.
(Retries do add more complexity here, but I think that's actually an argument to KISS and just shutdown(), since shutdown() already needs to work properly with retries.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with Option 1.
I am not going with Option 2 because it makes assumption of what delayedTransport shutdown() would do, which is not good separation of concern.
@ejona86 PTAL
Set a failing subchannelPicker for panic mode instead of creating a separate picker field for panic mode. Use a boolean "panicMode" to indicate this mode. Refactored out shutdownNameResolverAndLoadBalancer(), called from three code paths: enterIdleMode(), delayedTransport.transportTerminated(), and panic(). Refactored out updateSubchannelPicker(), called from both updateBalancingState() and panic(). Added tests to cover: 1. Call exitIdleMode() in panic mode 2. Call shutdown() in panic mode 3. Calls buffered in delayedTransport are failed when channel panics
when I update grpc version from 1.6.1 to 1.17.1, i encountered this problem. |
@jylsoccer, the channel should never enter panic mode. If it does, it is a bug and we want to fix the problem. We added panic mode to 1) reduce the "blast radius" when such a bug occurs and 2) clearly communicate the exception to the user so they can file a useful bug report. So please file an issue with the exception that you are seeing that enters the channel into panic mode. |
Channel enters this mode whenever there is an uncaught throwable from
its ChannelExecutor, which is where most channel internal states are
mutated, such as load-balancing.
In panic mode, the channel will always report TRANSIENT_FAILURE as its
state, and will fail new RPCs with an INTERNAL error code with the
uncaught throwable as the cause, which is helpful for investigating
bugs within gRPC and 3rd-party LoadBalancer implementations.
If LoadBalancer.handleResolvedAddressGroups() throws, channel will panic
instead of routing the exception back to
LoadBalancer.handleNameResolutionError()
Resolves #3293