QP reverse proxy gets stuck when the failure of the underlying tcp connection is not detected #14776

skonto · 2024-01-10T13:21:47Z

/area networking

What version of Knative?

1.10+

Expected Behavior

Downstream we face an issue similar to golang/go#59690, see the related blog post here. It happens with a faulty request that kills the connection but reverse proxy in QP never gets informed about it. I followed the workaround here golang/go#59690 (comment) and it allows the connection to be re-established, however once a connection is deliberately failed it takes some time until it gets re-created and that time depends on the configuration (a few secs).
Specifically I changed https://github.com/knative/serving/blob/main/vendor/knative.dev/pkg/network/h2c.go#L49:
to:

		AllowHTTP:          true,
		DisableCompression: disableCompression,
		//// Enable healthchecks by setting the idle timeout
		ReadIdleTimeout: time.Second * 10,
		//// Change from the default of 15s
		PingTimeout: time.Second * 2,
		DialTLS: func(netw, addr string, _ *tls.Config) (net.Conn, error) {
			return DialWithBackOff(context.Background(),
				netw, addr)
		},

Details about the fields:

	// frame will be carried out if no frame is received on the connection.
	// Note that a ping response will is considered a received frame, so if
	// there is no other traffic on the connection, the health check will
	// be performed every ReadIdleTimeout interval.
	// If zero, no health check is performed.
	ReadIdleTimeout time.Duration

	// PingTimeout is the timeout after which the connection will be closed
	// if a response to Ping is not received.
	// Defaults to 15s.
	PingTimeout time.Duration

Here is the output of the QP without and with the workaround:
flan-t5-small-caikit-predictor-00001-deployment-84bc77fd7d5v4qj-queue-no-workaround.log

flan-t5-small-caikit-predictor-00001-deployment-9bb78c88c-2g6jx-queue-proxy-with-work.log

The critical part is this:

2023/12/22 16:18:16 http2: Transport received PING flags=ACK len=8 ping="f\xf7\x8e6\xe4ԟ'"
2023/12/22 16:18:16 http2: Framer 0xc0001077a0: read GOAWAY len=22 LastStreamID=45 ErrCode=ENHANCE_YOUR_CALM Debug="too_many_pings"
2023/12/22 16:18:16 http2: Transport received GOAWAY len=22 LastStreamID=45 ErrCode=ENHANCE_YOUR_CALM Debug="too_many_pings"
2023/12/22 16:18:16 transport got GOAWAY with error code = ENHANCE_YOUR_CALM
2023/12/22 16:18:16 http2: Transport readFrame error on conn 0xc000528180: (*errors.errorString) EOF

The last msg shows that connection is over. Also later on you see new connections via entries like:

2023/12/22 16:18:23 http2: Transport creating client conn 0xc000528c00 to 127.0.0.1:8085

As a side note we already have a keepalive set to 5s set with the Dialer (https://github.com/knative/serving/blob/main/vendor/knative.dev/pkg/network/transports.go#L87) but the extra ones make Server send the GOAWAY reply.

In any case we need to protect reverse proxy for these kind of issues and until this is fixed at the Golang side we need to at least provide some mechanism for working around this. For now I propose we expose the transport configuration eg. via env vars or some annotation in order users to be able to fix this (to some extent). The downside is that configuring transport depends on the app traffic rate and if there is no traffic this will keep re-creating the connection. Also if you have many requests at the time the issue hits all will fail until connection recovers. In any case it does not protect apps from someone exploiting this deliberately.

Actual Behavior

QP gets stuck and can't serve traffic. Note here that restarting either the app or the QP container resolves the issue since connection is re-established (see more https://issues.redhat.com/browse/RHOAIENG-165).

Steps to Reproduce the Problem

The reproducer is quite complex and you need to follow steps here: https://github.com/skonto/debug-caikit-serving/tree/main. It has been observed on OCP so far. Locally I tried to create a small reproducer also check here, but connection seems not to get stuck and I am still working on it.

The text was updated successfully, but these errors were encountered:

skonto · 2024-01-10T13:26:09Z

cc @ReToCode @dprotaso

dprotaso · 2024-01-12T20:53:24Z

Reading the golang issues it sounds like we can mitigate this failure by setting either ReadIdleTimeout or PingTimeout no? Setting those will start health checking the connection. Would we want to do that for all workloads?

skonto · 2024-01-15T12:29:13Z

@dprotaso hi,

I managed to debug the python grpc server too.

I tested the following scenario:
a) send a successful grpc request

b) issue the bad http request

c) send a failed grpc request

transformer log:
flan-t5-small-caikit-predictor-00001-deployment-b598c9f69-l4xpw-transformer-container.log

QP log:
flan-t5-small-caikit-predictor-00001-deployment-b598c9f69-l4xpw-queue-proxy.log

The bad request on the python logs creates:

E0111 11:29:04.370439388 30 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value
E0111 11:29:04.370465569 30 parsing.cc:913] INCOMING[0x7ff9a80023c0;0x7ff9a8015b80]: Parse failed with INTERNAL: Error parsing 'content-type' metadata [type.googleapis.com/grpc.status.int.stream_id='0']
D0111 11:29:04.370478619 30 chttp2_transport.cc:2372] MARK_STREAM_CLOSED: t=0x7ff9a80023c0 s=0x7ff9a8015b80(id=7) read+write [INTERNAL:Error parsing 'content-type' metadata

Then all requests fail with:

E0111 11:29:07.103099567 30 parsing.cc:913] INCOMING[0x7ff9a80023c0;0x7ff9a8015b80]: Parse failed with INTERNAL: Error parsing 'content-type' metadata [type.googleapis.com/grpc.status.int.stream_id='0']

This looks very similar to: grpc/grpc#34721.
An http request breaks the grpc connection status at the python side.
Check the comment here: grpc/grpc#34721 (comment)

There are two ways to fix this:
a) block the http2 request if it is not a grpc call (maybe Istio can do that based on the content-type header of a request or even http reverse proxy could do that). Needs verification.
b) fix the python issue

I will try to discuss at the grpc project side.

Reading the golang issues it sounds like we can mitigate this failure by setting either ReadIdleTimeout or PingTimeout no? Setting those will start health checking the connection. Would we want to do that for all workloads?

An "ugly" workaround but not a fix is to try restart connections at the golang side by deliberately making the server say GO_AWAY and terminate current connection. This happens because we send pings at different rates eg. dialer has a 5s keep alive and above I also set a keep alive at 10 sec at the transport level (my assumption here).
Also a malicious user can always keep sending bad requests.

However, we might need to set the appropriate ReadIdleTimeout or PingTimeout settings as in Azure/azure-sdk-for-go-extensions#29 for all workloads in order to deal with golang/go#59690. That issue is caused by a dropped tcp connection the current here is not but the same settings are useful in both scenarios for different reasons. It needs investigation if the same setting values workaround both issues at once.

To summarize:
a) golang/go#59690 should be worked around with ReadIdleTimeout or PingTimeout settings
b) current issue here has some potential fixes:
- block the http2 request if it is not a grpc call (maybe Istio can do that based on the content-type header of a
request or even http reverse proxy could do that). Needs verification.
- fix the python issue
and a potential workaround (hack) with ReadIdleTimeout or PingTimeout set in order to make server send GO_AWAY.

dtrifiro · 2024-03-18T14:08:32Z

As mentioned in the grpcio issue (grpc/grpc#34721 (comment)), the issue is not present anymore, at least with grpcio>=1.62.0

skonto added the kind/bug Categorizes issue or PR as related to a bug. label Jan 10, 2024

knative-prow bot added the area/networking label Jan 10, 2024

skonto changed the title ~~QP reverse proxy gets stuck when underlying failed tcp connection is not detected~~ QP reverse proxy gets stuck when the underlying failed tcp connection is not detected Jan 10, 2024

skonto changed the title ~~QP reverse proxy gets stuck when the underlying failed tcp connection is not detected~~ QP reverse proxy gets stuck when the failure of the underlying tcp connection is not detected Jan 10, 2024

dprotaso added the triage/accepted Issues which should be fixed (post-triage) label Jan 10, 2024

skonto mentioned this issue Jan 15, 2024

Parse failed with INTERNAL: Error parsing ':method' metadata randomly appear until reboot grpc/grpc#34721

Closed

skonto closed this as completed Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QP reverse proxy gets stuck when the failure of the underlying tcp connection is not detected #14776

QP reverse proxy gets stuck when the failure of the underlying tcp connection is not detected #14776

skonto commented Jan 10, 2024

skonto commented Jan 10, 2024

dprotaso commented Jan 12, 2024

skonto commented Jan 15, 2024 •

edited

Loading

dtrifiro commented Mar 18, 2024

QP reverse proxy gets stuck when the failure of the underlying tcp connection is not detected #14776

QP reverse proxy gets stuck when the failure of the underlying tcp connection is not detected #14776

Comments

skonto commented Jan 10, 2024

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

skonto commented Jan 10, 2024

dprotaso commented Jan 12, 2024

skonto commented Jan 15, 2024 • edited Loading

dtrifiro commented Mar 18, 2024

skonto commented Jan 15, 2024 •

edited

Loading