Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interface/grpc/core: ListenTask gets canceled #549

Closed
ilgooz opened this issue Oct 23, 2018 · 16 comments · Fixed by #738
Closed

interface/grpc/core: ListenTask gets canceled #549

ilgooz opened this issue Oct 23, 2018 · 16 comments · Fixed by #738
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed
Milestone

Comments

@ilgooz
Copy link
Contributor

ilgooz commented Oct 23, 2018

This happened to me with a service that uses go-service. And several times more in back with some other services that I don't remember (maybe webhook service). I'm not sure if I faced with this issue with mesg-js. So this might be relavent with go-service or can be caused from something totally else. The problem occurs a while after (<=40mins) starting the service which causes service to not react further task executions.

✔ Core stopped
✔ Core started
time="2018-10-23T09:18:11Z" level=info msg="deploying (2) system services..."
time="2018-10-23T09:18:12Z" level=info msg="'resolver' system service deployed"
time="2018-10-23T09:18:19Z" level=info msg="'workflow' system service deployed"
time="2018-10-23T09:18:19Z" level=info msg="starting system services..."
time="2018-10-23T09:18:20Z" level=info msg="all system services started"
time="2018-10-23T09:18:20Z" level=info msg="starting MESG Core version local"
time="2018-10-23T09:18:20Z" level=info msg="server listens on [::]:50052"
time="2018-10-23T09:18:27Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=CreateWorkflow grpc.service=api.Core grpc.start_time="2018-10-23T09:18:26Z" grpc.time_ms=1019.393 span.kind=server system=grpc
time="2018-10-23T09:18:27Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=SubmitResult grpc.request.deadline="2018-10-23T09:18:37Z" grpc.service=api.Service grpc.start_time="2018-10-23T09:18:27Z" grpc.time_ms=0.07 span.kind=server system=grpc
time="2018-10-23T09:18:36Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=SubmitResult grpc.request.deadline="2018-10-23T09:18:46Z" grpc.service=api.Service grpc.start_time="2018-10-23T09:18:36Z" grpc.time_ms=0.087 span.kind=server system=grpc
time="2018-10-23T09:18:36Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=CreateWorkflow grpc.service=api.Core grpc.start_time="2018-10-23T09:18:35Z" grpc.time_ms=1012.981 span.kind=server system=grpc
>> time="2018-10-23T09:59:56Z" level=error msg="finished streaming call with code Unknown" error="context canceled" grpc.code=Unknown grpc.method=ListenTask grpc.service=api.Service grpc.start_time="2018-10-23T09:18:21Z" grpc.time_ms=2.4979815e+06 span.kind=server system=grpc
@ilgooz ilgooz added bug Something isn't working high priority labels Oct 23, 2018
@ilgooz ilgooz changed the title interface/grpc/core: ListenTask gets cancaled interface/grpc/core: ListenTask gets canceled Oct 23, 2018
@antho1404
Copy link
Member

I think i've got that issue too, some execution were not executed. I looked at the logs but didn't see any logs with the stream closing but I had many logs so probably missed it.

This one definitely need some investigations

@ilgooz
Copy link
Contributor Author

ilgooz commented Nov 7, 2018

After some more experiments, as I remember this is not directly relavent with elapsed time or maybe it is but not too long like 40mins and it doesn't happen by itself, i think it gets triggered by a task execution. Cancellation logs will be shown depending on that. And I'm sure that this happens to me with service-webhook so it's not relevant with go-service client.

@antho1404
Copy link
Member

I got this issue also working with only javascript services. This is the actual project I was working on https://github.com/antho1404/pegasys-hackathon but I couldn't find a way to reproduce yet

@NicolasMahe
Copy link
Member

NicolasMahe commented Nov 7, 2018 via email

@ilgooz
Copy link
Contributor Author

ilgooz commented Nov 7, 2018

@NicolasMahe yes, we should check that one as well.

@NicolasMahe NicolasMahe added this to the v0.5.0 milestone Nov 27, 2018
@ilgooz
Copy link
Contributor Author

ilgooz commented Dec 3, 2018

This is not relevant with mac going to sleep. I started a service and haven't put my mac on sleep. After one hour I tried to execute a task on that service and here I got the listen task canceled error at the bottom:

✔ Core started
time="2018-12-03T07:35:29Z" level=info msg="deploying (0) system services..."
time="2018-12-03T07:35:29Z" level=info msg="starting system services..."
time="2018-12-03T07:35:29Z" level=info msg="all system services have been started"
time="2018-12-03T07:35:29Z" level=info msg="starting MESG Core version local"
time="2018-12-03T07:35:29Z" level=info msg="server listens on [::]:50052"
time="2018-12-03T07:35:57Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListServices grpc.service=api.Core grpc.start_time="2018-12-03T07:35:57Z" grpc.time_ms=14.637 span.kind=server system=grpc
time="2018-12-03T07:36:04Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=StartService grpc.service=api.Core grpc.start_time="2018-12-03T07:36:02Z" grpc.time_ms=2241.19 span.kind=server system=grpc
time="2018-12-03T07:36:17Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetService grpc.service=api.Core grpc.start_time="2018-12-03T07:36:17Z" grpc.time_ms=4.547 span.kind=server system=grpc
time="2018-12-03T07:36:18Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ExecuteTask grpc.service=api.Core grpc.start_time="2018-12-03T07:36:18Z" grpc.time_ms=16.073 span.kind=server system=grpc
time="2018-12-03T07:36:18Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=SubmitResult grpc.request.deadline="2018-12-03T07:36:28Z" grpc.service=api.Service grpc.start_time="2018-12-03T07:36:18Z" grpc.time_ms=1.131 span.kind=server system=grpc
time="2018-12-03T07:36:18Z" level=error msg="finished streaming call with code Unknown" error="context canceled" grpc.code=Unknown grpc.method=ListenResult grpc.service=api.Core grpc.start_time="2018-12-03T07:36:17Z" grpc.time_ms=1037.138 span.kind=server system=grpc
time="2018-12-03T08:32:14Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetService grpc.service=api.Core grpc.start_time="2018-12-03T08:32:14Z" grpc.time_ms=6.482 span.kind=server system=grpc
time="2018-12-03T08:32:15Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ExecuteTask grpc.service=api.Core grpc.start_time="2018-12-03T08:32:15Z" grpc.time_ms=16.376 span.kind=server system=grpc
time="2018-12-03T08:32:15Z" level=error msg="finished streaming call with code Unknown" error="context canceled" grpc.code=Unknown grpc.method=ListenTask grpc.service=api.Service grpc.start_time="2018-12-03T07:36:04Z" grpc.time_ms=3.375066e+06 span.kind=server system=grpc

@NicolasMahe NicolasMahe removed this from the v0.5.0 milestone Dec 4, 2018
@NicolasMahe NicolasMahe added the help wanted Extra attention is needed label Dec 4, 2018
@krhubert krhubert self-assigned this Dec 11, 2018
@NicolasMahe NicolasMahe added this to the v0.6 milestone Dec 11, 2018
@antho1404
Copy link
Member

For information, there is an error not returned here. If there is an error at the creation of the stream this error is not shown, maybe this is related.

https://github.com/mesg-foundation/core/blob/925cd85930975f2171aa0786cb1fb4bfab643173/commands/provider/service_provider.go#L186-L192

@NicolasMahe NicolasMahe modified the milestones: v0.6, v0.7 Dec 29, 2018
@krhubert
Copy link
Contributor

krhubert commented Jan 4, 2019

Long story short:

INFO: 2018/12/18 20:56:14 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
WARNING: 2018/12/18 21:03:52 transport: http2Server.HandleStreams failed to read frame: read tcp 10.255.0.46:50052->10.255.0.2:36348: read: connection reset by peer

The INFO tells us an interesting thing - the error type is ErrConnClosing (https://github.com/grpc/grpc-go/blob/master/internal/transport/transport.go#L676)
and the grpc stream treats this as NORMAL stream close - https://github.com/grpc/grpc-go/blob/master/internal/transport/controlbuf.go#L425

In such situation, client sees this as close stream.

A solution, implement reconnection policy in the client, but this is the funniest part, not only in case of error, but also in case of normal connection close!

func listen() {
  for {
    // increas sleep with every faild connection (restart after successful conneciton0)
    // first try 1s, 5s, 20s
    time.Sleep(1 * time.Second)

    // (re)connect listen task api
    stream, err := ListenTask(context.Background())
    if err != nil {
      // connect to listen task failed
      continue
    }

    // buffered channel, because it might happen that ctx.Done
    // will be notifed first before stream.Recv in for loop.
    done := make(chan struct{}, 1)

    go func(ctx context.Context) {
      select {
      case <-ctx.Done():
        // log error from  ctx.Err()
      case <-done:
        // do nothing
      }
    }(stream.Context())

    for {
      td, err := stream.Recv()
      if err != nil {
        // notify goroutine on any kind of error
        done <- struct{}{}
        break
      }
      // normal path with submit errors/results
    }

    // NOTE: even if listen task will be canceled without error
    // it will still reconnect. So we need to create issue in grpc how to solve this (haven't seen)
    // or implement this in our rpc protocol layer
  }
}

other solution: maybe there is an api that will allow us to distinguish normal connection close and ErrConnClosing
above code dosne't allow us to remove this behavior (it will always exists - even without docker) so we need to be prepared
one more thing, cli execute should never hang forever. Core dosen't have any mechanism to recognize if someone is listening for specific event (for example list of ListenTask caller)

@NicolasMahe
Copy link
Member

NicolasMahe commented Jan 4, 2019

@krhubert nice summary of the problem 👍

In the current implementation of stream in client side, there is nothing about the context (eg: <-context.Done() and context.Err()).
@krhubert do you think the client could get the error from context.Err() and know if it needs to retry?

I think the first implementation that needs to be done is to add support of <-context.Done() and context.Err().
Then if it not enough, we can implement an acknowledgement system. See end of this message.

@krhubert do we have a way to reproduce the error? even if it required to change the source of gRPC, we need to reproduce the error so we can actually test that the fix actually work.


I moved the content of this message to the forum because it was not only related to this issue.
See: https://forum.mesg.com/t/acknowledgement-system-on-stream/172

With the proposed acknowledgement system, the server could also send last message (as a header or trailer) on the gRPC stream that indicates the stream will be close gracefully.
With this knowledge, the client knows if it needs to retry the connection or not.


Take a look at trailer: https://github.com/grpc/grpc-go/blob/master/Documentation/grpc-metadata.md#streaming-call-2

@ilgooz
Copy link
Contributor Author

ilgooz commented Jan 27, 2019

we don't have keepAlive set for underlying tcp connections so this issue might be relevant with that too. i'm not expecting this but maybe Docker's network is configured in a way to disconnect the idle connections –or might be a bug?

@NicolasMahe
Copy link
Member

What do you mean keepAlive? is it a option in gRPC?

@ilgooz
Copy link
Contributor Author

ilgooz commented Jan 28, 2019

There is an option in net.Dialer to enable sending keep alive pings. I don't know if grpc-go client uses that or implements its own but we have an option for it.

https://godoc.org/google.golang.org/grpc#WithKeepaliveParams
https://godoc.org/google.golang.org/grpc#KeepaliveParams

https://github.com/grpc/grpc-go/blob/dfd7708d35ae1a95e1b85ae3d8f3fdce5fb60a92/internal/transport/http2_client.go#L1267

I'll try this and provide the results.

@ilgooz
Copy link
Contributor Author

ilgooz commented Jan 28, 2019

I'm testing the keepAlive, in the meanwhile, I've a few more findings:

  • the issue is definitely not specific to mac (Docker Machine) nor OS going to sleep because same problem happens on DO droplets too.

  • issue might be relevant to inactivity in gRPC streams. if keepAlive doesn't work, i'll do an another test by creating a script to execute a task for each minute and see if the problem occur again in one hour period. because this problem is happening by executing a task after like 30mins inactivity. stream is only getting canceled after executing a task but not by itself. because execute task requests will make Core to send a message to listenTask's stream but when stream is closed, sending operation will return with an error.

@ilgooz
Copy link
Contributor Author

ilgooz commented Jan 28, 2019

Good news, I tested client side keep alive by waiting for 40mins. After that I executed a task on the service and there are no stream cancelation, it works!

My configs were:

			grpc.WithKeepaliveParams(keepalive.ClientParameters{
				Time:                time.Minute * 2,
				Timeout:             time.Minute * 3,
				PermitWithoutStream: true,
			}),

I'll try the same configs in server side. That way when can apply the configs on server side once, server sends the keep alive pings and clients doesn't need to worry about setting such a config.

@ilgooz
Copy link
Contributor Author

ilgooz commented Jan 28, 2019

fixed by #738 6e52544 🎊 🎊

@ilgooz ilgooz closed this as completed Jan 28, 2019
@antho1404
Copy link
Member

antho1404 commented Jan 28, 2019

The way to reproduce this issue:

  • Create a service that does not emit any event
  • Run this service with ./dev-cli service dev ...
  • Wait 15 min...
  • Execute a task from this service

You will get an error on the server with

ERRO[2330] finished streaming call with code Unknown     error="context canceled" grpc.code=Unknown grpc.method=ListenTask grpc.service=api.Service grpc.start_time="2019-01-28T09:30:21Z" grpc.time_ms=1.2310282e+06 span.kind=server system=grpc

The dev command will not receive any informations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants