Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: add panic mode for ManagedChannelImpl #4023

Merged
merged 6 commits into from
Feb 22, 2018

Conversation

zhangkun83
Copy link
Contributor

@zhangkun83 zhangkun83 commented Jan 31, 2018

Channel enters this mode whenever there is an uncaught throwable from
its ChannelExecutor, which is where most channel internal states are
mutated, such as load-balancing.

In panic mode, the channel will always report TRANSIENT_FAILURE as its
state, and will fail new RPCs with an INTERNAL error code with the
uncaught throwable as the cause, which is helpful for investigating
bugs within gRPC and 3rd-party LoadBalancer implementations.

If LoadBalancer.handleResolvedAddressGroups() throws, channel will panic
instead of routing the exception back to
LoadBalancer.handleNameResolutionError()

Resolves #3293

Channel enters this mode whenever there is an uncaught throwable from
its ChannelExecutor, which is where most channel internal states are
mutated, such as load-balancing.

In panic mode, the channel will always report TRANSIENT_FAILURE as its
state, and will fail RPCs with an INTERNAL error code with the
uncaught throwable as the cause, which is helpful for investigating
bugs within gRPC and 3rd-party LoadBalancer implementations.
return delayedTransport;
SubchannelPicker pickerCopy = subchannelPicker;
PickResult pickResult = panicPickResult;
if (pickResult == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not wild that we have yet another variable of state to consider here. I'd rather something closer to one of these:

  1. Set subchannelPicker to the dropping picker. And then I guess put a check for panic mode in exitIdleMode().
  2. Actually go SHUTDOWN, but plumb in the failing Status into delayedTransport so RPCs would still fail with useful message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Option 1, how do you check for panic mode without having a state dedicated to panic mode?

For Option 2, do you mean we will call shutdown() (or more likely shutdownNow())? The shutdown mechanism may be a little too sophisticated for entering panic mode. What if the bug is in the shutdown code? IMHO, the panic mode should be implemented in a way as simple as possible, so that it is the last functionality to fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For option 1, we could have a boolean panicMode that is only needed by exitIdleMode() (or similar). What I don't want is the variable to add another dimension of the various states that we have to consider frequently.

For option 2, at the very least I think we should:

  • Set the shutdown AtomicBoolean to true
  • Transition channelStateManager to SHUTDOWN
  • Shut down delayedTransport

Shutting down delayedTransport is the most-risky of those the throw an exception, but it does the important stuff first (set shutdownStatus). So as long as we prevent loops (say, by doing nothing if shutdown.get() == true), then that seems fine. And looking at shutdown(), it basically does exactly that. The only extra thing it does is cancelIdleTimer(). So that seems quite nice.

(Retries do add more complexity here, but I think that's actually an argument to KISS and just shutdown(), since shutdown() already needs to work properly with retries.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with Option 1.

I am not going with Option 2 because it makes assumption of what delayedTransport shutdown() would do, which is not good separation of concern.

@ejona86 PTAL

Set a failing subchannelPicker for panic mode instead of creating a
separate picker field for panic mode.  Use a boolean "panicMode" to
indicate this mode.

Refactored out shutdownNameResolverAndLoadBalancer(), called from
three code paths: enterIdleMode(),
delayedTransport.transportTerminated(), and panic().

Refactored out updateSubchannelPicker(), called from both
updateBalancingState() and panic().

Added tests to cover:
1. Call exitIdleMode() in panic mode
2. Call shutdown() in panic mode
3. Calls buffered in delayedTransport are failed when channel panics
@zhangkun83 zhangkun83 merged commit 46c1133 into grpc:master Feb 22, 2018
@zhangkun83 zhangkun83 deleted the panic_mode branch February 22, 2018 17:28
@jylsoccer
Copy link

when I update grpc version from 1.6.1 to 1.17.1, i encountered this problem.
I cache a managedChannel object, when the channel entered panic mode, I can not use it anymore.
how can I tell if the channel is in panic mode? because when it happens, i have to create a new channel.

@ejona86
Copy link
Member

ejona86 commented Jan 14, 2019

@jylsoccer, the channel should never enter panic mode. If it does, it is a bug and we want to fix the problem. We added panic mode to 1) reduce the "blast radius" when such a bug occurs and 2) clearly communicate the exception to the user so they can file a useful bug report. So please file an issue with the exception that you are seeing that enters the channel into panic mode.

@lock lock bot locked as resolved and limited conversation to collaborators Apr 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Go to "permanent error mode" if ChannelExecutor throws exception
3 participants