Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify GRPC channel connectivity state READY to IDLE transition #332

Closed
CMCDragonkai opened this issue Feb 14, 2022 · 16 comments
Closed

Verify GRPC channel connectivity state READY to IDLE transition #332

CMCDragonkai opened this issue Feb 14, 2022 · 16 comments
Assignees
Labels
development Standard development epic Big issue with multiple subissues r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management

Comments

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Feb 14, 2022

Specification

In #224 and #310 we introduced the ability for our GRPCClient to react the state changes in the underlying client's channel state. In particular, we perform a this.destroy() when the underlying channel is no longer live.

One state transition we were not able to test for. This is when the underlying channel state goes from READY to IDLE.

According to #310 (comment), this can occur one of 2 ways:

  1. when the server closes the connection (already have a test for this)
  2. when the underlying GRPC TTL times out (no test for this yet)

So the issue is that, we want to trigger the GRPC TTL and actually test that this transition actually occurs.

Does this GRPC TTL actually exist? Well according to the C++-docs, it is controlled by this parameter https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#ga51ab062269cd81298f5adb6fd9a45e99, and the default value is 30 minutes. When @tegefaulkes was trying to run it, he couldn't see it after 6 minutes.

Furthermore the transition tables https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md indicate that READY becomes IDLE after the idle timeout passes.

In the source code of grpc-js, there is no mention of an idle timeout parameter on the channel options, so it's possible that this hasn't been exposed in grpc-js, or maybe even not implemented.

 /**
   * An interface that contains options used when initializing a Channel instance.
   */
  export interface ChannelOptions {
    'grpc.ssl_target_name_override'?: string;
    'grpc.primary_user_agent'?: string;
    'grpc.secondary_user_agent'?: string;
    'grpc.default_authority'?: string;
    'grpc.keepalive_time_ms'?: number;
    'grpc.keepalive_timeout_ms'?: number;
    'grpc.keepalive_permit_without_calls'?: number;
    'grpc.service_config'?: string;
    'grpc.max_concurrent_streams'?: number;
    'grpc.initial_reconnect_backoff_ms'?: number;
    'grpc.max_reconnect_backoff_ms'?: number;
    'grpc.use_local_subchannel_pool'?: number;
    'grpc.max_send_message_length'?: number;
    'grpc.max_receive_message_length'?: number;
    'grpc.enable_http_proxy'?: number;
    'grpc.http_connect_target'?: string;
    'grpc.http_connect_creds'?: string;
    'grpc.default_compression_algorithm'?: CompressionAlgorithms;
    'grpc.enable_channelz'?: number;
    'grpc-node.max_session_memory'?: number;
    // eslint-disable-next-line @typescript-eslint/no-explicit-any
    [key: string]: any;
  }

A question about this has been posted: https://groups.google.com/g/grpc-io/c/yq4pmBGaXOQ

@joshuakarp please also create an issue on grpc-js github requesting it as a feature addition.

Additional context

  • upstream issue for feature request Set a channel's idle timeout grpc/grpc-node#2046

  • Propagate networking connection error handling into NodeConnection error handling #224 - Original issue about exposing channel state to our GRPC abstractions and
    image

    @startuml
    ' Inactive connection
    state "Inactive connection" as NC1 {
      NC1 : running: false
      NC1 : destroyed: false
      NC1 : status: starting
      state "Client" as Client1 {
        Client1 : IDLE
      }
    }
    
    NC1 --> NC2 : attempt connection
    
    ' Connecting to target
    state "Connecting to target" as NC2 {
      NC2 : running: false
      NC2 : destroyed: false
      NC2 : status: starting
      state "Client" as Client2 {
        Client2 : CONNECTING
      }
    }
     
    NC2 --> NC3 : successfully connected to target
    
    ' Active connection
    state "Active connection" as NC3 {
      NC3 : running: true
      NC3 : destroyed: false
      NC3 : status: starting
      state "Client" as Client3 {
        Client3 : READY
      }
    }
    
    NC3 --> NC3 : gRPC call performed
    'NC3 --> NC8 : gRPC TTL expires
    NC3 --> NC9 : server closes connection
    NC3 --> NC5 : error during gRPC call
    
    ' Idle connection
    'state "Idle connection" as NC8 {
    '  NC8 : running: true
    '  NC8 : destroyed: false
    '  NC8 : status: starting
    '  state "Client" as Client8 {
    '    Client8 : IDLE
    '  }
    '}
    
    'NC8 --> NC3 : gRPC call invoked
    
    ' Dropped connection
    state "Dropped connection" as NC9 {
      NC9 : running: true
      NC9 : destroyed: false
      NC9 : status: destroying
      state "Client" as Client9 {
        Client9 : IDLE
      }
    }
    
    NC2 --> NC5 : timed out | failed to validate certificate
    NC9 --> NC7 : channel watcher destroys\n client and shuts down
    
    ' Failed certificate validation
    state "Connection error" as NC5 {
      NC5 : running: true
      NC5 : destroyed: false
      NC5 : status: starting
      state "Client" as Client5 {
        Client5 : TRANSIENT_FAILURE
      }
    }
    
    NC3 --> NC6 : destroy connection | connection TTL expired
    NC5 --> NC7 : destroy client
    
    ' Destroy NodeConnection
    state "Destroying connection" as NC6 {
      NC6 : running: true
      NC6 : destroyed: false
      NC6 : status: destroying
      state "Client" as Client6 {
        Client6 : READY
      }
    }
    
    NC6 --> NC7 : destroy client
    
    ' Destroy client
    state "Destroyed connection" as NC7 {
      NC7 : running: false
      NC7 : destroyed: true
      NC7 : status: destroying
      state "Client" as Client7 {
        Client7 : SHUTDOWN
      }
    }
    
    ' Destroy client
    'state "Destroyed connection" as NC10 {
    '  NC10 : running: false
    '  NC10 : destroyed: true
    '  NC10 : status: destroying
    '  state "Client" as Client10 {
    '    Client10 : IDLE
    '  }
    '}
    @enduml
    

Tasks

  1. ...
  2. ...
  3. ...
@joshuakarp
Copy link
Contributor

According to https://groups.google.com/g/grpc-io/c/yq4pmBGaXOQ, this isn't currently supported, so I'll post an issue upstream.

@joshuakarp
Copy link
Contributor

I've created an upstream issue for this.

@CMCDragonkai
Copy link
Member Author

Does that mean there's no IDLE timeout and therefore the connection may stay READY forever if there is no activity on the connection?

If so, how does this impact our model of connection failure such as when the underlying TCP socket is destroyed from the server side?

@CMCDragonkai
Copy link
Member Author

I'd also like to know how the existing grpc keep alive parameters work. Can they help in this arena as well? Or will they just duplicate the functionality of our NodeConnection ttl?

Please add these into the issue feature request for clarification.

@joshuakarp
Copy link
Contributor

Does that mean there's no IDLE timeout and therefore the connection may stay READY forever if there is no activity on the connection?

@tegefaulkes said that he'd tested this previously, and had seen the connection go from READY to IDLE after approximately 6 minutes of no activity, so I don't believe this is the case.

I'd also like to know how the existing grpc keep alive parameters work. Can they help in this arena as well? Or will they just duplicate the functionality of our NodeConnection ttl?

Please add these into the issue feature request for clarification.

Added.

@tegefaulkes
Copy link
Contributor

Either it will say READY forever or timeout after 30 min or some other arbitrary timeout. In terms of how we handle our connection logic. It shouldn't make a difference unless our own TTL exceeds the timeout of the GRPC in which we treat it as a normal connection failure.

I'm just making an assumption here but I feel the keep alive parameters relate to the TCP keep alive pings? such as the time spacing of the pings, how long the connection waits after a missing keep alive ping before closing the connection.

@tegefaulkes
Copy link
Contributor

I tested to see how long a timeout would take with the assumption that it would be 5 min, I gave up after 6 min after not seeing it return to IDLE. so I don't know how long the timeout is but it's longer than 6 min.

@joshuakarp
Copy link
Contributor

I tested to see how long a timeout would take with the assumption that it would be 5 min, I gave up after 6 min after not seeing it return to IDLE. so I don't know how long the timeout is but it's longer than 6 min.

Ah my bad - I had thought that you'd seen it transition after 6 minutes. This makes more sense in the context of everything.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Feb 15, 2022

So we need a test for the TCP is destroyed on the server side without shutting down the GRPC server. #224 (comment)

  1. With an active connection, close only the TCP layer. We can invoke this by grabbing the socket itself (may require a monkeypatch), call end() and destroy() on the socket, give a timeout for 1 second. We expect the client to go IDLE as well. The following:
     ┌──────┐    gRPC   ┌──────┐
     │      │◄─────────►│      │
     │Client│           │Server│
     │      │◄─────────x│      │
     └──────┘    TCP    └──────┘
    

In this case, since there's no idle timeout functionality in GRPC, then one of 2 things can happen:

  1. GRPC's keepalive timeout triggers
  2. Our NodeConnection TTL triggers

What is the default for GRPC's keepalive timeout? And does it result in the same READY to IDLE transition? If the default exists, we should at least configure it to either be removed, and rely on our node connection timeout, or make it shorter for some other reason.

We want to avoid duplicate functionality here as it complicates debugging in the future.

    'grpc.keepalive_time_ms'?: number;
    'grpc.keepalive_timeout_ms'?: number;
    'grpc.keepalive_permit_without_calls'?: number;

@CMCDragonkai
Copy link
Member Author

If GRPC's keepalive timeout already does what we want for NodeConnection TTL, then we have duplicate functionality here. I would prefer we keep our NodeConnection TTL simply because we have better control over it, and somehow disable the GRPC keepalive. However if GRPC keepalive has some special ability that our NodeConnection TTL doesn't, then we need to work that out.

@CMCDragonkai
Copy link
Member Author

My guess is that GRPC's keepalive would keep the GRPC connection alive as long as the underlying socket is still live. Meaning that even if I don't do any calls on it, it will be kept at READY state.

Then that means our NodeConnection TTL works at the level of calls. Timing out after no calls are made. Whereas GRPC keepalive ensures that the GRPC connection is still alive.

Both of these can work simultaneously, but of course our GRPC keepalive timeout should be much shorter than our NodeConnection TTL.

Finally the network UTP keepalive is completely independent of both, but should be shorter than the GRPC keepalive timeout.

@CMCDragonkai
Copy link
Member Author

@tegefaulkes did you only test the GRPC connections over TCP, or also over the network domain's UTP? If not, you should have a separate set of tests doing the same thing but over the network domain's UTP, and you would want to ensure forward proxy and reverse proxy are running. Then you can simulate:

  1. NodeConnection TTL at 10 minutes
  2. GRPC keep alive timeout at 1 minute (60 seconds)
  3. UTP keepalive timeout at 20 seconds

We need both the GRPC over TCP and GRPC over network UTP to be sure.

@tegefaulkes
Copy link
Contributor

Currently all of the tests are going through the proxies. I don't know if that's testing with TCP or UTP. I don't know the details of that.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Feb 15, 2022

Ok well in that case, it's simple then, if you want to test how the GRPC keepalive works, you can do it with directly over TCP instead.

Doing it over TCP will be important for solving #276.

tegefaulkes added a commit that referenced this issue Mar 24, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
CMCDragonkai pushed a commit that referenced this issue Mar 26, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
CMCDragonkai pushed a commit that referenced this issue Mar 26, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
tegefaulkes added a commit that referenced this issue Mar 28, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
tegefaulkes added a commit that referenced this issue Mar 28, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
CMCDragonkai pushed a commit that referenced this issue Mar 29, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
tegefaulkes added a commit that referenced this issue Mar 29, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
tegefaulkes added a commit that referenced this issue Mar 30, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
tegefaulkes added a commit that referenced this issue Apr 8, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
tegefaulkes added a commit that referenced this issue Jun 1, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
tegefaulkes added a commit that referenced this issue Jun 1, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
tegefaulkes added a commit that referenced this issue Jun 2, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
tegefaulkes added a commit that referenced this issue Jun 6, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
tegefaulkes added a commit that referenced this issue Jun 7, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
tegefaulkes added a commit that referenced this issue Jun 10, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
emmacasolin pushed a commit that referenced this issue Jun 14, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
emmacasolin pushed a commit that referenced this issue Jun 14, 2022
Added a callback to the `Proxy` that is called when a `ForwardConnection` or `ReverseConnection` is established and authenticated. It is called with the following connection information; `remoteNodeId`, `remoteHost`, `remotePort` and `type`. They type signifies if it was a forward or reverse connection. Note that this is only triggered by composed connections.

Added a test for if the callback was called when a `ReverseConnection` is established.

Relates #332
Relates #344
@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management label Jul 24, 2022
@CMCDragonkai
Copy link
Member Author

As we transition to the JSON RPC system, the underlying transport protocol no longer matters relative to the RPC. Instead the ReadableWritablePair stream is what will be used by the RPC system. That object is what is needed to communicate errors or "stream state".

Right now, the object would contain both a readable stream and a writable stream. This mean it's possible for there to be separate errors for each. You could have an error for the readable stream, but not for the writable stream and vice-versa.

It seems we still need to be able to connect the underlying state changes, by adding an event handler to the stream so that the RPC client abstraction can also be errored out.

We have to standardise on how we do this push-based dataflow. 2 ways:

  • Setting up custom callbacks during construction, or setters
  • Extending the EventTarget and making use of the standard ways of registering and deregistering for events

Important that any errors thrown in the event handler gets caught and gets re-emitted as an error event.

It's possible to convert from any event emitting object into an observable (specifically a subject observable), this may be more elegant for composition purposes, and there's an issue exploring this in #444.

@CMCDragonkai
Copy link
Member Author

This is no longer relevant, as we are moving to using our own transport agnostic RPC.

However both the websocket and quic transports have events that correspond to state changes of the transport/stream/connection.

For the websocket transport, there will be a close event that occurs on the socket object (this is for the WS library).

For the quic transport, there is events corresponding to the connection, and to the stream separately.

With respect to the RPCClient, it only has a factory function that acquires a stream when a method call is made.

await rpcClient.methods.unary();

So internally it calls the stream factory (the callback could take parameters) to acquire a stream.

That stream is then used to do the actual operation.

This stream only exists for the lifetime of the call itself. Otherwise the RPC client does manage any additional state.

Which means the RPCClient has no need to keep track of state transitions of the underlying transport.

So what happens when a connection state fails. It would mean that the next time one calls the await rpcClient.methods.unary(), it would call the stream factory callback, and that callback would throw an exception relating to the fact that the connection is broken.

This is great because our RPCClient is now stateless. It's not keeping any state, therefore, it does need any of this complicated transition tracking.

The complexity of connection state transition tracking is now moved to a separate layer of the system. Specifically the actual transport implementations.

@CMCDragonkai CMCDragonkai closed this as not planned Won't fix, can't repro, duplicate, stale Feb 17, 2023
@CMCDragonkai CMCDragonkai self-assigned this Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development epic Big issue with multiple subissues r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management
Development

No branches or pull requests

3 participants