Verify GRPC channel connectivity state `READY` to `IDLE` transition #332

CMCDragonkai · 2022-02-14T05:15:14Z

Specification

In #224 and #310 we introduced the ability for our GRPCClient to react the state changes in the underlying client's channel state. In particular, we perform a this.destroy() when the underlying channel is no longer live.

One state transition we were not able to test for. This is when the underlying channel state goes from READY to IDLE.

According to #310 (comment), this can occur one of 2 ways:

when the server closes the connection (already have a test for this)

when the underlying GRPC TTL times out (no test for this yet)

So the issue is that, we want to trigger the GRPC TTL and actually test that this transition actually occurs.

Does this GRPC TTL actually exist? Well according to the C++-docs, it is controlled by this parameter https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#ga51ab062269cd81298f5adb6fd9a45e99, and the default value is 30 minutes. When @tegefaulkes was trying to run it, he couldn't see it after 6 minutes.

Furthermore the transition tables https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md indicate that READY becomes IDLE after the idle timeout passes.

In the source code of grpc-js, there is no mention of an idle timeout parameter on the channel options, so it's possible that this hasn't been exposed in grpc-js, or maybe even not implemented.

 /**
   * An interface that contains options used when initializing a Channel instance.
   */
  export interface ChannelOptions {
    'grpc.ssl_target_name_override'?: string;
    'grpc.primary_user_agent'?: string;
    'grpc.secondary_user_agent'?: string;
    'grpc.default_authority'?: string;
    'grpc.keepalive_time_ms'?: number;
    'grpc.keepalive_timeout_ms'?: number;
    'grpc.keepalive_permit_without_calls'?: number;
    'grpc.service_config'?: string;
    'grpc.max_concurrent_streams'?: number;
    'grpc.initial_reconnect_backoff_ms'?: number;
    'grpc.max_reconnect_backoff_ms'?: number;
    'grpc.use_local_subchannel_pool'?: number;
    'grpc.max_send_message_length'?: number;
    'grpc.max_receive_message_length'?: number;
    'grpc.enable_http_proxy'?: number;
    'grpc.http_connect_target'?: string;
    'grpc.http_connect_creds'?: string;
    'grpc.default_compression_algorithm'?: CompressionAlgorithms;
    'grpc.enable_channelz'?: number;
    'grpc-node.max_session_memory'?: number;
    // eslint-disable-next-line @typescript-eslint/no-explicit-any
    [key: string]: any;
  }

A question about this has been posted: https://groups.google.com/g/grpc-io/c/yq4pmBGaXOQ

@joshuakarp please also create an issue on grpc-js github requesting it as a feature addition.

Additional context

upstream issue for feature request Set a channel's idle timeout grpc/grpc-node#2046

Propagate networking connection error handling into NodeConnection error handling #224 - Original issue about exposing channel state to our GRPC abstractions and

@startuml
' Inactive connection
state "Inactive connection" as NC1 {
  NC1 : running: false
  NC1 : destroyed: false
  NC1 : status: starting
  state "Client" as Client1 {
    Client1 : IDLE
  }
}

NC1 --> NC2 : attempt connection

' Connecting to target
state "Connecting to target" as NC2 {
  NC2 : running: false
  NC2 : destroyed: false
  NC2 : status: starting
  state "Client" as Client2 {
    Client2 : CONNECTING
  }
}
 
NC2 --> NC3 : successfully connected to target

' Active connection
state "Active connection" as NC3 {
  NC3 : running: true
  NC3 : destroyed: false
  NC3 : status: starting
  state "Client" as Client3 {
    Client3 : READY
  }
}

NC3 --> NC3 : gRPC call performed
'NC3 --> NC8 : gRPC TTL expires
NC3 --> NC9 : server closes connection
NC3 --> NC5 : error during gRPC call

' Idle connection
'state "Idle connection" as NC8 {
'  NC8 : running: true
'  NC8 : destroyed: false
'  NC8 : status: starting
'  state "Client" as Client8 {
'    Client8 : IDLE
'  }
'}

'NC8 --> NC3 : gRPC call invoked

' Dropped connection
state "Dropped connection" as NC9 {
  NC9 : running: true
  NC9 : destroyed: false
  NC9 : status: destroying
  state "Client" as Client9 {
    Client9 : IDLE
  }
}

NC2 --> NC5 : timed out | failed to validate certificate
NC9 --> NC7 : channel watcher destroys\n client and shuts down

' Failed certificate validation
state "Connection error" as NC5 {
  NC5 : running: true
  NC5 : destroyed: false
  NC5 : status: starting
  state "Client" as Client5 {
    Client5 : TRANSIENT_FAILURE
  }
}

NC3 --> NC6 : destroy connection | connection TTL expired
NC5 --> NC7 : destroy client

' Destroy NodeConnection
state "Destroying connection" as NC6 {
  NC6 : running: true
  NC6 : destroyed: false
  NC6 : status: destroying
  state "Client" as Client6 {
    Client6 : READY
  }
}

NC6 --> NC7 : destroy client

' Destroy client
state "Destroyed connection" as NC7 {
  NC7 : running: false
  NC7 : destroyed: true
  NC7 : status: destroying
  state "Client" as Client7 {
    Client7 : SHUTDOWN
  }
}

' Destroy client
'state "Destroyed connection" as NC10 {
'  NC10 : running: false
'  NC10 : destroyed: true
'  NC10 : status: destroying
'  state "Client" as Client10 {
'    Client10 : IDLE
'  }
'}
@enduml

Tasks

...
...
...

The text was updated successfully, but these errors were encountered:

joshuakarp · 2022-02-14T22:40:40Z

According to https://groups.google.com/g/grpc-io/c/yq4pmBGaXOQ, this isn't currently supported, so I'll post an issue upstream.

joshuakarp · 2022-02-14T23:02:11Z

I've created an upstream issue for this.

CMCDragonkai · 2022-02-15T00:30:14Z

Does that mean there's no IDLE timeout and therefore the connection may stay READY forever if there is no activity on the connection?

If so, how does this impact our model of connection failure such as when the underlying TCP socket is destroyed from the server side?

CMCDragonkai · 2022-02-15T00:31:20Z

I'd also like to know how the existing grpc keep alive parameters work. Can they help in this arena as well? Or will they just duplicate the functionality of our NodeConnection ttl?

Please add these into the issue feature request for clarification.

joshuakarp · 2022-02-15T00:43:57Z

Does that mean there's no IDLE timeout and therefore the connection may stay READY forever if there is no activity on the connection?

@tegefaulkes said that he'd tested this previously, and had seen the connection go from READY to IDLE after approximately 6 minutes of no activity, so I don't believe this is the case.

I'd also like to know how the existing grpc keep alive parameters work. Can they help in this arena as well? Or will they just duplicate the functionality of our NodeConnection ttl?

Please add these into the issue feature request for clarification.

Added.

tegefaulkes · 2022-02-15T00:44:07Z

Either it will say READY forever or timeout after 30 min or some other arbitrary timeout. In terms of how we handle our connection logic. It shouldn't make a difference unless our own TTL exceeds the timeout of the GRPC in which we treat it as a normal connection failure.

I'm just making an assumption here but I feel the keep alive parameters relate to the TCP keep alive pings? such as the time spacing of the pings, how long the connection waits after a missing keep alive ping before closing the connection.

tegefaulkes · 2022-02-15T00:46:02Z

I tested to see how long a timeout would take with the assumption that it would be 5 min, I gave up after 6 min after not seeing it return to IDLE. so I don't know how long the timeout is but it's longer than 6 min.

joshuakarp · 2022-02-15T00:51:26Z

I tested to see how long a timeout would take with the assumption that it would be 5 min, I gave up after 6 min after not seeing it return to IDLE. so I don't know how long the timeout is but it's longer than 6 min.

Ah my bad - I had thought that you'd seen it transition after 6 minutes. This makes more sense in the context of everything.

CMCDragonkai · 2022-02-15T00:51:58Z

So we need a test for the TCP is destroyed on the server side without shutting down the GRPC server. #224 (comment)

With an active connection, close only the TCP layer. We can invoke this by grabbing the socket itself (may require a monkeypatch), call end() and destroy() on the socket, give a timeout for 1 second. We expect the client to go IDLE as well. The following:
 ┌──────┐    gRPC   ┌──────┐
 │      │◄─────────►│      │
 │Client│           │Server│
 │      │◄─────────x│      │
 └──────┘    TCP    └──────┘

In this case, since there's no idle timeout functionality in GRPC, then one of 2 things can happen:

GRPC's keepalive timeout triggers
Our NodeConnection TTL triggers

What is the default for GRPC's keepalive timeout? And does it result in the same READY to IDLE transition? If the default exists, we should at least configure it to either be removed, and rely on our node connection timeout, or make it shorter for some other reason.

We want to avoid duplicate functionality here as it complicates debugging in the future.

    'grpc.keepalive_time_ms'?: number;
    'grpc.keepalive_timeout_ms'?: number;
    'grpc.keepalive_permit_without_calls'?: number;

CMCDragonkai · 2022-02-15T00:54:29Z

If GRPC's keepalive timeout already does what we want for NodeConnection TTL, then we have duplicate functionality here. I would prefer we keep our NodeConnection TTL simply because we have better control over it, and somehow disable the GRPC keepalive. However if GRPC keepalive has some special ability that our NodeConnection TTL doesn't, then we need to work that out.

CMCDragonkai · 2022-02-15T00:58:59Z

My guess is that GRPC's keepalive would keep the GRPC connection alive as long as the underlying socket is still live. Meaning that even if I don't do any calls on it, it will be kept at READY state.

Then that means our NodeConnection TTL works at the level of calls. Timing out after no calls are made. Whereas GRPC keepalive ensures that the GRPC connection is still alive.

Both of these can work simultaneously, but of course our GRPC keepalive timeout should be much shorter than our NodeConnection TTL.

Finally the network UTP keepalive is completely independent of both, but should be shorter than the GRPC keepalive timeout.

CMCDragonkai · 2022-02-15T01:03:18Z

@tegefaulkes did you only test the GRPC connections over TCP, or also over the network domain's UTP? If not, you should have a separate set of tests doing the same thing but over the network domain's UTP, and you would want to ensure forward proxy and reverse proxy are running. Then you can simulate:

NodeConnection TTL at 10 minutes
GRPC keep alive timeout at 1 minute (60 seconds)
UTP keepalive timeout at 20 seconds

We need both the GRPC over TCP and GRPC over network UTP to be sure.

tegefaulkes · 2022-02-15T01:51:12Z

Currently all of the tests are going through the proxies. I don't know if that's testing with TCP or UTP. I don't know the details of that.

CMCDragonkai · 2022-02-15T02:36:12Z

Ok well in that case, it's simple then, if you want to test how the GRPC keepalive works, you can do it with directly over TCP instead.

Doing it over TCP will be important for solving #276.

Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections. Added a test for if the callback was called when a `ReverseConnection` is established. Relates #332 Relates #344

CMCDragonkai · 2023-01-09T02:40:15Z

As we transition to the JSON RPC system, the underlying transport protocol no longer matters relative to the RPC. Instead the ReadableWritablePair stream is what will be used by the RPC system. That object is what is needed to communicate errors or "stream state".

Right now, the object would contain both a readable stream and a writable stream. This mean it's possible for there to be separate errors for each. You could have an error for the readable stream, but not for the writable stream and vice-versa.

It seems we still need to be able to connect the underlying state changes, by adding an event handler to the stream so that the RPC client abstraction can also be errored out.

We have to standardise on how we do this push-based dataflow. 2 ways:

Setting up custom callbacks during construction, or setters
Extending the EventTarget and making use of the standard ways of registering and deregistering for events

Important that any errors thrown in the event handler gets caught and gets re-emitted as an error event.

It's possible to convert from any event emitting object into an observable (specifically a subject observable), this may be more elegant for composition purposes, and there's an issue exploring this in #444.

CMCDragonkai · 2023-02-17T06:53:03Z

This is no longer relevant, as we are moving to using our own transport agnostic RPC.

However both the websocket and quic transports have events that correspond to state changes of the transport/stream/connection.

For the websocket transport, there will be a close event that occurs on the socket object (this is for the WS library).

For the quic transport, there is events corresponding to the connection, and to the stream separately.

With respect to the RPCClient, it only has a factory function that acquires a stream when a method call is made.

await rpcClient.methods.unary();

So internally it calls the stream factory (the callback could take parameters) to acquire a stream.

That stream is then used to do the actual operation.

This stream only exists for the lifetime of the call itself. Otherwise the RPC client does manage any additional state.

Which means the RPCClient has no need to keep track of state transitions of the underlying transport.

So what happens when a connection state fails. It would mean that the next time one calls the await rpcClient.methods.unary(), it would call the stream factory callback, and that callback would throw an exception relating to the fact that the connection is broken.

This is great because our RPCClient is now stateless. It's not keeping any state, therefore, it does need any of this complicated transition tracking.

The complexity of connection state transition tracking is now moved to a separate layer of the system. Specifically the actual transport implementations.

CMCDragonkai added development Standard development epic Big issue with multiple subissues labels Feb 14, 2022

This was referenced Feb 14, 2022

Extracting Node Connection Management out of NodeManager to NodeConnectionManager #310

Merged

Propagate networking connection error handling into NodeConnection error handling #224

Closed

joshuakarp mentioned this issue Feb 14, 2022

Set a channel's idle timeout grpc/grpc-node#2046

Closed

CMCDragonkai mentioned this issue Feb 25, 2022

Write tests to cover all possible states of the Discovery, NodeConnection, and discovered Node during the discovery process #349

Open

CMCDragonkai added the r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management label Jul 24, 2022

tegefaulkes mentioned this issue Feb 1, 2023

Transport Agnostic RPC implementation #498

Merged

19 tasks

CMCDragonkai closed this as not planned Won't fix, can't repro, duplicate, stale Feb 17, 2023

CMCDragonkai self-assigned this Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify GRPC channel connectivity state `READY` to `IDLE` transition #332

Verify GRPC channel connectivity state `READY` to `IDLE` transition #332

CMCDragonkai commented Feb 14, 2022 •

edited by joshuakarp

Loading

joshuakarp commented Feb 14, 2022

joshuakarp commented Feb 14, 2022

CMCDragonkai commented Feb 15, 2022

CMCDragonkai commented Feb 15, 2022

joshuakarp commented Feb 15, 2022

tegefaulkes commented Feb 15, 2022

tegefaulkes commented Feb 15, 2022

joshuakarp commented Feb 15, 2022

CMCDragonkai commented Feb 15, 2022 •

edited

Loading

CMCDragonkai commented Feb 15, 2022

CMCDragonkai commented Feb 15, 2022

CMCDragonkai commented Feb 15, 2022

tegefaulkes commented Feb 15, 2022

CMCDragonkai commented Feb 15, 2022 •

edited

Loading

CMCDragonkai commented Jan 9, 2023

CMCDragonkai commented Feb 17, 2023

Verify GRPC channel connectivity state READY to IDLE transition #332

Verify GRPC channel connectivity state READY to IDLE transition #332

Comments

CMCDragonkai commented Feb 14, 2022 • edited by joshuakarp Loading

Specification

Additional context

Tasks

joshuakarp commented Feb 14, 2022

joshuakarp commented Feb 14, 2022

CMCDragonkai commented Feb 15, 2022

CMCDragonkai commented Feb 15, 2022

joshuakarp commented Feb 15, 2022

tegefaulkes commented Feb 15, 2022

tegefaulkes commented Feb 15, 2022

joshuakarp commented Feb 15, 2022

CMCDragonkai commented Feb 15, 2022 • edited Loading

CMCDragonkai commented Feb 15, 2022

CMCDragonkai commented Feb 15, 2022

CMCDragonkai commented Feb 15, 2022

tegefaulkes commented Feb 15, 2022

CMCDragonkai commented Feb 15, 2022 • edited Loading

CMCDragonkai commented Jan 9, 2023

CMCDragonkai commented Feb 17, 2023

Verify GRPC channel connectivity state `READY` to `IDLE` transition #332

Verify GRPC channel connectivity state `READY` to `IDLE` transition #332

CMCDragonkai commented Feb 14, 2022 •

edited by joshuakarp

Loading

CMCDragonkai commented Feb 15, 2022 •

edited

Loading

CMCDragonkai commented Feb 15, 2022 •

edited

Loading