core: add panic mode for ManagedChannelImpl #4023

zhangkun83 · 2018-01-31T00:38:01Z

Channel enters this mode whenever there is an uncaught throwable from
its ChannelExecutor, which is where most channel internal states are
mutated, such as load-balancing.

In panic mode, the channel will always report TRANSIENT_FAILURE as its
state, and will fail new RPCs with an INTERNAL error code with the
uncaught throwable as the cause, which is helpful for investigating
bugs within gRPC and 3rd-party LoadBalancer implementations.

If LoadBalancer.handleResolvedAddressGroups() throws, channel will panic
instead of routing the exception back to
LoadBalancer.handleNameResolutionError()

Resolves #3293

Channel enters this mode whenever there is an uncaught throwable from its ChannelExecutor, which is where most channel internal states are mutated, such as load-balancing. In panic mode, the channel will always report TRANSIENT_FAILURE as its state, and will fail RPCs with an INTERNAL error code with the uncaught throwable as the cause, which is helpful for investigating bugs within gRPC and 3rd-party LoadBalancer implementations.

caught by ChannelExecutor instead of routing it back to LoadBalancer.

ejona86 · 2018-02-09T16:24:31Z

core/src/main/java/io/grpc/internal/ManagedChannelImpl.java

-        return delayedTransport;
+      SubchannelPicker pickerCopy = subchannelPicker;
+      PickResult pickResult = panicPickResult;
+      if (pickResult == null) {


I'm not wild that we have yet another variable of state to consider here. I'd rather something closer to one of these:

Set subchannelPicker to the dropping picker. And then I guess put a check for panic mode in exitIdleMode().

Actually go SHUTDOWN, but plumb in the failing Status into delayedTransport so RPCs would still fail with useful message.

For Option 1, how do you check for panic mode without having a state dedicated to panic mode?

For Option 2, do you mean we will call shutdown() (or more likely shutdownNow())? The shutdown mechanism may be a little too sophisticated for entering panic mode. What if the bug is in the shutdown code? IMHO, the panic mode should be implemented in a way as simple as possible, so that it is the last functionality to fail.

For option 1, we could have a boolean panicMode that is only needed by exitIdleMode() (or similar). What I don't want is the variable to add another dimension of the various states that we have to consider frequently.

For option 2, at the very least I think we should:

Set the shutdown AtomicBoolean to true

Transition channelStateManager to SHUTDOWN

Shut down delayedTransport

Shutting down delayedTransport is the most-risky of those the throw an exception, but it does the important stuff first (set shutdownStatus). So as long as we prevent loops (say, by doing nothing if shutdown.get() == true), then that seems fine. And looking at shutdown(), it basically does exactly that. The only extra thing it does is cancelIdleTimer(). So that seems quite nice.

(Retries do add more complexity here, but I think that's actually an argument to KISS and just shutdown(), since shutdown() already needs to work properly with retries.)

I went with Option 1.

I am not going with Option 2 because it makes assumption of what delayedTransport shutdown() would do, which is not good separation of concern.

@ejona86 PTAL

Set a failing subchannelPicker for panic mode instead of creating a separate picker field for panic mode. Use a boolean "panicMode" to indicate this mode. Refactored out shutdownNameResolverAndLoadBalancer(), called from three code paths: enterIdleMode(), delayedTransport.transportTerminated(), and panic(). Refactored out updateSubchannelPicker(), called from both updateBalancingState() and panic(). Added tests to cover: 1. Call exitIdleMode() in panic mode 2. Call shutdown() in panic mode 3. Calls buffered in delayedTransport are failed when channel panics

jylsoccer · 2019-01-14T06:22:11Z

when I update grpc version from 1.6.1 to 1.17.1, i encountered this problem.
I cache a managedChannel object, when the channel entered panic mode, I can not use it anymore.
how can I tell if the channel is in panic mode? because when it happens, i have to create a new channel.

ejona86 · 2019-01-14T17:11:59Z

@jylsoccer, the channel should never enter panic mode. If it does, it is a bug and we want to fix the problem. We added panic mode to 1) reduce the "blast radius" when such a bug occurs and 2) clearly communicate the exception to the user so they can file a useful bug report. So please file an issue with the exception that you are seeing that enters the channel into panic mode.

zhangkun83 requested a review from ejona86 January 31, 2018 00:38

zhangkun83 added 2 commits January 30, 2018 17:15

If LoadBalancer throws in handleResolvedAddressGroups(), make it

19f5856

caught by ChannelExecutor instead of routing it back to LoadBalancer.

Merge branch 'master' into panic_mode

42191bc

ejona86 reviewed Feb 9, 2018

View reviewed changes

zhangkun83 mentioned this pull request Feb 15, 2018

FutureStub hangs when wrong netty version is used #2976

Closed

zhangkun83 added 3 commits February 21, 2018 09:32

Merge branch 'master' into panic_mode

11a4ec5

Fix minor issues

ed2dd00

ejona86 approved these changes Feb 22, 2018

View reviewed changes

zhangkun83 merged commit 46c1133 into grpc:master Feb 22, 2018

zhangkun83 added the Type: Feature label Feb 22, 2018

zhangkun83 deleted the panic_mode branch February 22, 2018 17:28

lock bot locked as resolved and limited conversation to collaborators Apr 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: add panic mode for ManagedChannelImpl #4023

core: add panic mode for ManagedChannelImpl #4023

zhangkun83 commented Jan 31, 2018 •

edited

Loading

ejona86 Feb 9, 2018

zhangkun83 Feb 16, 2018

ejona86 Feb 17, 2018

zhangkun83 Feb 21, 2018

jylsoccer commented Jan 14, 2019

ejona86 commented Jan 14, 2019

core: add panic mode for ManagedChannelImpl #4023

core: add panic mode for ManagedChannelImpl #4023

Conversation

zhangkun83 commented Jan 31, 2018 • edited Loading

ejona86 Feb 9, 2018

Choose a reason for hiding this comment

zhangkun83 Feb 16, 2018

Choose a reason for hiding this comment

ejona86 Feb 17, 2018

Choose a reason for hiding this comment

zhangkun83 Feb 21, 2018

Choose a reason for hiding this comment

jylsoccer commented Jan 14, 2019

ejona86 commented Jan 14, 2019

zhangkun83 commented Jan 31, 2018 •

edited

Loading