Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC-stream does not work intermittently #7375

Closed
emeroad opened this issue Oct 28, 2020 · 4 comments
Closed

gRPC-stream does not work intermittently #7375

emeroad opened this issue Oct 28, 2020 · 4 comments

Comments

@emeroad
Copy link
Member

emeroad commented Oct 28, 2020

isReady() on grpc stream returns false, state is not restored.

log

2020-10-23 11:17:11 [INFO ](c.n.p.p.s.g.PingStreamContext      ) PingStream-2 success:
2020-10-23 11:18:11 [INFO ](c.n.p.p.s.g.PingStreamContext      ) PingStream-2 success:
2020-10-23 11:19:11 [INFO ](c.n.p.p.s.g.PingStreamContext      ) PingStream-2 success:
2020-10-23 11:20:11 [INFO ](c.n.p.p.s.g.PingStreamContext      ) PingStream-2 success:
2020-10-23 11:21:11 [INFO ](c.n.p.p.s.g.PingStreamContext      ) PingStream-2 success:
2020-10-23 11:21:57 [INFO ](c.n.p.p.s.g.SpanGrpcDataSender     ) Discard PSpanMessage message, stream not ready. discardCount:1
2020-10-23 11:21:57 [INFO ](c.n.p.p.s.g.SpanGrpcDataSender     ) Discard PSpanMessage message, stream not ready. discardCount:2
2020-10-23 11:21:57 [INFO ](c.n.p.p.s.g.SpanGrpcDataSender     ) Discard PSpanMessage message, stream not ready. discardCount:3
2020-10-23 11:21:57 [INFO ](c.n.p.p.s.g.SpanGrpcDataSender     ) Discard PSpanMessage message, stream not ready. discardCount:4
2020-10-23 11:21:57 [INFO ](c.n.p.p.s.g.SpanGrpcDataSender     ) Discard PSpanMessage message, stream not ready. discardCount:5
2020-10-23 11:21:57 [INFO ](c.n.p.p.s.g.SpanGrpcDataSender     ) Discard PSpanMessage message, stream not ready. discardCount:6
2020-10-23 11:21:57 [INFO ](c.n.p.p.s.g.SpanGrpcDataSender     ) Discard PSpanMessage message, stream not ready. discardCount:7
2020-10-23 11:21:57 [INFO ](c.n.p.p.s.g.SpanGrpcDataSender     ) Discard PSpanMessage message, stream not ready. discardCount:8
2020-10-23 11:21:57 [INFO ](c.n.p.p.s.g.SpanGrpcDataSender     ) Discard PSpanMessage message, stream not ready. discardCount:9
2020-10-23 11:21:57 [INFO ](c.n.p.p.s.g.SpanGrpcDataSender     ) Discard PSpanMessage message, stream not ready. discardCount:10
2020-10-23 11:21:57 [INFO ](c.n.p.p.s.g.SpanGrpcDataSender     ) Discard PSpanMessage message, stream not ready. discardCount:11

DiscardClientCall.java

    private boolean readyState() {
        // skip pending queue state : DelayedStream
        if (this.onReadyState.get() == false) {
            final long pendingCount = this.pendingCounter.incrementAndGet();
            if (pendingCount > this.maxPendingThreshold) {
                return false;
            }
            return true;
        }
        return isReady(); <<----  return false
    }

Bug estimation

  1. There is a bug in the throttling(flow control) of the collector.
    • StreamExecutorServerInterceptorFactory.java
    • Not easy to reproduce.
  2. Using too old grpc version
    • grpc-14.0
    • Minimum jdk version must be 1.7 for version up
    • I analyzed the commit log of grpc, but it's hard to find anything suspicious.

Plan

  • Add logs to the agent for problem analysis.
  • Change minimum jdk version to 1.7?
emeroad added a commit to emeroad/pinpoint that referenced this issue Oct 29, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Oct 30, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Nov 4, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Nov 4, 2020
jaehong-kim added a commit that referenced this issue Nov 10, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Nov 11, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Nov 11, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Nov 11, 2020
emeroad added a commit that referenced this issue Nov 11, 2020
@emeroad
Copy link
Member Author

emeroad commented Nov 11, 2020

Add grpc log option to agent.
-Dpinpoint.profiler.grpc.log.enable=true

NOTE

JDK logger is shared with the application, so it can affect the log
However, if the logger is isolated with a classloader like tomcat, there is no side effect.

  • eg) org.apache.juli.ClassLoaderLogManager

emeroad added a commit to emeroad/pinpoint that referenced this issue Nov 11, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Nov 11, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Nov 11, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Nov 12, 2020
emeroad added a commit that referenced this issue Nov 12, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Nov 12, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Nov 12, 2020
emeroad added a commit that referenced this issue Nov 12, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Nov 13, 2020
emeroad added a commit that referenced this issue Nov 13, 2020
emeroad added a commit that referenced this issue Nov 16, 2020
@koo-taejin koo-taejin added this to the 2.1.1 milestone Nov 18, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Nov 27, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Dec 1, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Dec 4, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Dec 4, 2020
@emeroad
Copy link
Member Author

emeroad commented Dec 4, 2020

Add channelz metric for grpc debug.

Log directory : $PINPOINT_AGENT/logs/pinpoint-stat.log

RootChannel-1 {target=127.0.0.1:9993, stat=READY}                                                                                                                                                                                                                                                                 
RootChannel-1 {callsFailed=0, callsStarted=1, callsSucceeded=0, lastCallStartedNanos(ms)=1607071084311000000}                                                                                                                                                                                                     
RootChannel-1 subchannels:[3]                                                                                                                                                                                                                                                                                     
RootChannel-1 channelTrace {numEventsLogged=0, creationTimeNanos=1607071084246, events.size=7}                                                                                                                                                                                                                    
RootChannel-1 channelTrace.events:Event{description=Channel for '127.0.0.1:9993' created, severity=CT_INFO, timestampNanos(ms)=1607071084246}                                                                                                                                                                     
RootChannel-1 channelTrace.events:Event{description=Exiting idle mode, severity=CT_INFO, timestampNanos(ms)=1607071084290}                                                                                                                                                                                        
RootChannel-1 channelTrace.events:Event{description=Address resolved: [[[/127.0.0.1:9993]/{}]], severity=CT_INFO, timestampNanos(ms)=1607071084301}                                                                                                                                                               
RootChannel-1 channelTrace.events:Event{description=Child Subchannel started, severity=CT_INFO, timestampNanos(ms)=1607071084307, subchannelRef=InternalSubchannel{logId=3, addressGroups=[[[/127.0.0.1:9993]/{}]]}}                                                                                              
RootChannel-1 channelTrace.events:Event{description=Entering CONNECTING state with picker: Picker{result=PickResult{subchannel=Subchannel<3>: (127.0.0.1:9993), streamTracerFactory=null, status=Status{code=OK, description=null, cause=null}, drop=false}}, severity=CT_INFO, timestampNanos(ms)=1607071084308} 
RootChannel-1 channelTrace.events:Event{description=Entering CONNECTING state with picker: Picker{result=PickResult{subchannel=null, streamTracerFactory=null, status=Status{code=OK, description=null, cause=null}, drop=false}}, severity=CT_INFO, timestampNanos(ms)=1607071084770}                            
RootChannel-1 channelTrace.events:Event{description=Entering READY state with picker: Picker{result=PickResult{subchannel=Subchannel<3>: (127.0.0.1:9993), streamTracerFactory=null, status=Status{code=OK, description=null, cause=null}, drop=false}}, severity=CT_INFO, timestampNanos(ms)=1607071084816}      
1-SubChannel-3 {target=[[[/127.0.0.1:9993]/{}]], stat=READY}                                                                                                                                                                                                                                                      
1-SubChannel-3 {callsFailed=0, callsStarted=1, callsSucceeded=0, lastCallStartedNanos(ms)=1607071084826000000}                                                                                                                                                                                                    
1-SubChannel-3 sockets:[6]                                                                                                                                                                                                                                                                                        
1-SubChannel-3 channelTrace {numEventsLogged=0, creationTimeNanos=1607071084304, events.size=4}                                                                                                                                                                                                                   
1-SubChannel-3 channelTrace.events:Event{description=Subchannel for [[[/127.0.0.1:9993]/{}]] created, severity=CT_INFO, timestampNanos(ms)=1607071084304}                                                                                                                                                         
1-SubChannel-3 channelTrace.events:Event{description=CONNECTING as requested, severity=CT_INFO, timestampNanos(ms)=1607071084308}                                                                                                                                                                                 
1-SubChannel-3 channelTrace.events:Event{description=Started transport NettyClientTransport<6>: (/127.0.0.1:9993), severity=CT_INFO, timestampNanos(ms)=1607071084770}                                                                                                                                            
1-SubChannel-3 channelTrace.events:Event{description=READY, severity=CT_INFO, timestampNanos(ms)=1607071084815}                                                                                                                                                                                                   
3-Socket-6 {local=/127.0.0.1:3155, remote=/127.0.0.1:9993, security=null}                                                                                                                                                                                                                                         
3-Socket-6 {streamsStarted=1, lastLocalStreamCreatedTimeNanos(ms)=1607071084837000000, lastRemoteStreamCreatedTimeNanos(ms)=0}                                                                                                                                                                                    
3-Socket-6 {streamsSucceeded=0, streamsFailed=0, messagesSent=0, messagesReceived=0}                                                                                                                                                                                                                              
3-Socket-6 {keepAlivesSent=0, lastMessageSentTimeNanos=0, lastMessageReceivedTimeNanos=0, localFlowControlWindow=1048576, remoteFlowControlWindow=1048576}                                                                                                                                                        

emeroad added a commit to emeroad/pinpoint that referenced this issue Dec 4, 2020
@emeroad emeroad modified the milestones: 2.1.1, 2.2.1 Dec 4, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Dec 4, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Dec 7, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Dec 7, 2020
emeroad added a commit that referenced this issue Dec 7, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Dec 9, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Dec 9, 2020
emeroad added a commit to emeroad/pinpoint that referenced this issue Dec 11, 2020
emeroad added a commit that referenced this issue Dec 11, 2020
@emeroad
Copy link
Member Author

emeroad commented Dec 22, 2020

Two patches have been applied to 2.2.0

  1. gRPC versionup
  2. If something goes wrong with the grpc stream, recreate the stream.

emeroad added a commit to emeroad/pinpoint that referenced this issue Jan 4, 2021
emeroad added a commit to emeroad/pinpoint that referenced this issue Jan 4, 2021
emeroad added a commit to emeroad/pinpoint that referenced this issue Jan 4, 2021
emeroad added a commit to emeroad/pinpoint that referenced this issue Jan 4, 2021
@emeroad
Copy link
Member Author

emeroad commented Jan 20, 2021

I am not sure that this problem has been completely resolved.
We plan to continue investigating this issue.

@emeroad emeroad closed this as completed Jan 20, 2021
messi-gao pushed a commit to messi-gao/pinpoint that referenced this issue Jan 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants