Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QP reverse proxy gets stuck when the failure of the underlying tcp connection is not detected #14776

Closed
skonto opened this issue Jan 10, 2024 · 4 comments
Labels
area/networking kind/bug Categorizes issue or PR as related to a bug. triage/accepted Issues which should be fixed (post-triage)

Comments

@skonto
Copy link
Contributor

skonto commented Jan 10, 2024

/area networking

What version of Knative?

1.10+

Expected Behavior

Downstream we face an issue similar to golang/go#59690, see the related blog post here. It happens with a faulty request that kills the connection but reverse proxy in QP never gets informed about it. I followed the workaround here golang/go#59690 (comment) and it allows the connection to be re-established, however once a connection is deliberately failed it takes some time until it gets re-created and that time depends on the configuration (a few secs).
Specifically I changed https://github.com/knative/serving/blob/main/vendor/knative.dev/pkg/network/h2c.go#L49:
to:

		AllowHTTP:          true,
		DisableCompression: disableCompression,
		//// Enable healthchecks by setting the idle timeout
		ReadIdleTimeout: time.Second * 10,
		//// Change from the default of 15s
		PingTimeout: time.Second * 2,
		DialTLS: func(netw, addr string, _ *tls.Config) (net.Conn, error) {
			return DialWithBackOff(context.Background(),
				netw, addr)
		},

Details about the fields:

	// frame will be carried out if no frame is received on the connection.
	// Note that a ping response will is considered a received frame, so if
	// there is no other traffic on the connection, the health check will
	// be performed every ReadIdleTimeout interval.
	// If zero, no health check is performed.
	ReadIdleTimeout time.Duration

	// PingTimeout is the timeout after which the connection will be closed
	// if a response to Ping is not received.
	// Defaults to 15s.
	PingTimeout time.Duration

Here is the output of the QP without and with the workaround:
flan-t5-small-caikit-predictor-00001-deployment-84bc77fd7d5v4qj-queue-no-workaround.log

flan-t5-small-caikit-predictor-00001-deployment-9bb78c88c-2g6jx-queue-proxy-with-work.log

The critical part is this:

2023/12/22 16:18:16 http2: Transport received PING flags=ACK len=8 ping="f\xf7\x8e6\xe4ԟ'"
2023/12/22 16:18:16 http2: Framer 0xc0001077a0: read GOAWAY len=22 LastStreamID=45 ErrCode=ENHANCE_YOUR_CALM Debug="too_many_pings"
2023/12/22 16:18:16 http2: Transport received GOAWAY len=22 LastStreamID=45 ErrCode=ENHANCE_YOUR_CALM Debug="too_many_pings"
2023/12/22 16:18:16 transport got GOAWAY with error code = ENHANCE_YOUR_CALM
2023/12/22 16:18:16 http2: Transport readFrame error on conn 0xc000528180: (*errors.errorString) EOF

The last msg shows that connection is over. Also later on you see new connections via entries like:

2023/12/22 16:18:23 http2: Transport creating client conn 0xc000528c00 to 127.0.0.1:8085

As a side note we already have a keepalive set to 5s set with the Dialer (https://github.com/knative/serving/blob/main/vendor/knative.dev/pkg/network/transports.go#L87) but the extra ones make Server send the GOAWAY reply.

In any case we need to protect reverse proxy for these kind of issues and until this is fixed at the Golang side we need to at least provide some mechanism for working around this. For now I propose we expose the transport configuration eg. via env vars or some annotation in order users to be able to fix this (to some extent). The downside is that configuring transport depends on the app traffic rate and if there is no traffic this will keep re-creating the connection. Also if you have many requests at the time the issue hits all will fail until connection recovers. In any case it does not protect apps from someone exploiting this deliberately.

Actual Behavior

QP gets stuck and can't serve traffic. Note here that restarting either the app or the QP container resolves the issue since connection is re-established (see more https://issues.redhat.com/browse/RHOAIENG-165).

Steps to Reproduce the Problem

The reproducer is quite complex and you need to follow steps here: https://github.com/skonto/debug-caikit-serving/tree/main. It has been observed on OCP so far. Locally I tried to create a small reproducer also check here, but connection seems not to get stuck and I am still working on it.

@skonto skonto added the kind/bug Categorizes issue or PR as related to a bug. label Jan 10, 2024
@skonto
Copy link
Contributor Author

skonto commented Jan 10, 2024

cc @ReToCode @dprotaso

@skonto skonto changed the title QP reverse proxy gets stuck when underlying failed tcp connection is not detected QP reverse proxy gets stuck when the underlying failed tcp connection is not detected Jan 10, 2024
@skonto skonto changed the title QP reverse proxy gets stuck when the underlying failed tcp connection is not detected QP reverse proxy gets stuck when the failure of the underlying tcp connection is not detected Jan 10, 2024
@dprotaso dprotaso added the triage/accepted Issues which should be fixed (post-triage) label Jan 10, 2024
@dprotaso
Copy link
Member

Reading the golang issues it sounds like we can mitigate this failure by setting either ReadIdleTimeout or PingTimeout no? Setting those will start health checking the connection. Would we want to do that for all workloads?

@skonto
Copy link
Contributor Author

skonto commented Jan 15, 2024

@dprotaso hi,

I managed to debug the python grpc server too.

I tested the following scenario:
a) send a successful grpc request

b) issue the bad http request

c) send a failed grpc request

transformer log:
flan-t5-small-caikit-predictor-00001-deployment-b598c9f69-l4xpw-transformer-container.log

QP log:
flan-t5-small-caikit-predictor-00001-deployment-b598c9f69-l4xpw-queue-proxy.log

The bad request on the python logs creates:

E0111 11:29:04.370439388 30 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value
E0111 11:29:04.370465569 30 parsing.cc:913] INCOMING[0x7ff9a80023c0;0x7ff9a8015b80]: Parse failed with INTERNAL: Error parsing 'content-type' metadata [type.googleapis.com/grpc.status.int.stream_id='0']
D0111 11:29:04.370478619 30 chttp2_transport.cc:2372] MARK_STREAM_CLOSED: t=0x7ff9a80023c0 s=0x7ff9a8015b80(id=7) read+write [INTERNAL:Error parsing 'content-type' metadata

Then all requests fail with:

E0111 11:29:07.103099567 30 parsing.cc:913] INCOMING[0x7ff9a80023c0;0x7ff9a8015b80]: Parse failed with INTERNAL: Error parsing 'content-type' metadata [type.googleapis.com/grpc.status.int.stream_id='0']

This looks very similar to: grpc/grpc#34721.
An http request breaks the grpc connection status at the python side.
Check the comment here: grpc/grpc#34721 (comment)

There are two ways to fix this:
a) block the http2 request if it is not a grpc call (maybe Istio can do that based on the content-type header of a request or even http reverse proxy could do that). Needs verification.
b) fix the python issue

I will try to discuss at the grpc project side.

Reading the golang issues it sounds like we can mitigate this failure by setting either ReadIdleTimeout or PingTimeout no? Setting those will start health checking the connection. Would we want to do that for all workloads?

An "ugly" workaround but not a fix is to try restart connections at the golang side by deliberately making the server say GO_AWAY and terminate current connection. This happens because we send pings at different rates eg. dialer has a 5s keep alive and above I also set a keep alive at 10 sec at the transport level (my assumption here).
Also a malicious user can always keep sending bad requests.

However, we might need to set the appropriate ReadIdleTimeout or PingTimeout settings as in Azure/azure-sdk-for-go-extensions#29 for all workloads in order to deal with golang/go#59690. That issue is caused by a dropped tcp connection the current here is not but the same settings are useful in both scenarios for different reasons. It needs investigation if the same setting values workaround both issues at once.

To summarize:
a) golang/go#59690 should be worked around with ReadIdleTimeout or PingTimeout settings
b) current issue here has some potential fixes:
- block the http2 request if it is not a grpc call (maybe Istio can do that based on the content-type header of a
request or even http reverse proxy could do that). Needs verification.
- fix the python issue
and a potential workaround (hack) with ReadIdleTimeout or PingTimeout set in order to make server send GO_AWAY.

@dtrifiro
Copy link

As mentioned in the grpcio issue (grpc/grpc#34721 (comment)), the issue is not present anymore, at least with grpcio>=1.62.0

@skonto skonto closed this as completed Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/bug Categorizes issue or PR as related to a bug. triage/accepted Issues which should be fixed (post-triage)
Projects
None yet
Development

No branches or pull requests

3 participants