-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Web client connection pool may have connections with invalid HttpObjectEncoder state #177
Comments
I can confirm this issue.. hard to recreate, but often happens when the |
Hi, Is it possible to test with the latest available Netty - netty-4.1.16.Final Thanks, |
Please ignore it I saw in the stack trace that you are using it already. |
@alexmaddriver could you show some snippet that's representative of how the |
There might be a timing issue with the cleanup action that gets subscribed by When I was working on #176 , I noticed a similar issue when I first had an invalid fix for the memory leak I had found. Coincidentally that fix might also fix this issue since it would remove the possibility that the cleanup action would get signalled out-of-order. The reason why this could help is that the Could you @alexmaddriver please try out #176 ? It might not be the most optimal fix for this issue or even fix this issue, but it would be interesting to see if it changes anything. :) |
Hello, We use the latest spring-webflux 5.0.0.RELEASE. @lnxmad: We disable connection pool when creating WebClient as follows: @rstoyanchev: we do not use bodyToXXX, instead directly "targetResponse.body(toDataBuffers())", because of a recent bug with bodyToXXX methods which used to close connections as a side affect. As it was fixed in 0.7.0.RELEASE, we can now certainly switch to bodyToFlux. The code snippet: protected Mono<ResponseEntity<Flux>> proxyPostRequest(ServerHttpRequest request) { private Function<ClientResponse, ResponseEntity<Flux>> toResponseEntity() { |
@lhotari, thank you for explanation. Sure, we can give it a try and will come back with results. In fact, we put quite some effort to switch form VertX to web flux client to unify our product stack. And now very close to RTM version and disabling connection pool workaround is not good. Because of that willing to put our efforts to contribute in fixing the problem, or at least fully understand when the issue happens :) |
@alexmaddriver with regards to |
Hi, As we have several fixes recently that might be relevant, is it possible to test with the latest 0.7.1.BUILD-SNAPSHOT version. Thanks, |
Just tried the latest build and the problem is still there. It is easily reproduced after a few HTTP errors, e.g. 404. After that MessageToMessageEncoder gets into invalid state. |
We use reactor-netty + webflux client based proxy server between clients and a content server. Also, we wrote client emulating tool able to generate any load using both HTTP and HTTPS connectors, with any content size, including huge packages uploaded / downloaded in parallel. Such emulator is very handy to discover performance, concurrency and memory leak type of issues. If you want our QEs can share more details for you to be able to setup such testing in-house. |
@alexmaddriver I tried to use the snippet above to produce some reproducible example but I'm still not able to do that. If you have something can you share with us. And yes we are interested in the tool that you are developing if you can share more details it will be great. |
@violetagg I have a test based on |
@lhotari this is an API issue we aim to fix in 0.8 by reworking in full the way you consume a response. Right now nothing tells the flatMapMany code that extract your status code that it should also consume the body (and call receive().subscribe()). So we never fully consume it while you already have returned a code. |
We've put quite some efforts trying to understand what is going wrong. We can easily reproduce this on our machines however this seem to be a raise conditions thus still unable to write 100% reproducible case. Just wanted to summarize our findings and maybe you can give some hints where the bug could hide. The problem with HttpObjectEncoder occurs because it does not receive LastHttpContent message. Last content message should be generated inside HttpClientOperations.onOutboundComplete() method which is called by ChannelOperations.onComplete(). However sometimes onCompete is not called, and after this the encoder is broken. According to our tests the highest probability when onCompete() for client requests is not called happens in the following conditions:
@smaldini Do you have any idea why ChannelOperations.onComplete() might not be called? According to HttpClientOperations.checkResponseCode(), client errors are ignored, so what might difference between handling 404 vs 200 and how it might affect the flow? |
@alexmaddriver we're closing in another couple issues for 0.7.2 but we need to find a way to reproduce this one. In 0.7.1 we tried fixing this problem with a client by subscribing, where in 0.7.0 there was a similar issue to the one you just mentioned (hence why we insisted on the version). We were interrupting but not fully consuming the body allowing the next request to acquire a previously non finished request. Another thing i'm thinking about is similar to what you posted in #196, we might not just read enough and the last packet is not being received. What I don't get tho is if it wasn't the case you wouldn't acquire this connection in a next request because its still not marked as released until complete. |
With the latest 0.7.2.BUILD-SNAPSHOT I can't reproduce with 1000 req (max 256 parallel). What kind of sample size you run for and how is 404 response formed (has it a body ? what headers does it present). I will reuse that to produce a smoke test. Maybe what kind of remote server ? Tomcat ? something public we can hit ? @Test
public void httpStatusCode404IsHandledByTheClient() {
NettyContext server =
HttpServer.create(0)
.newRouter(r -> r.post("/test", (req, res) -> res.status(404)
.send(req.receive().retain().log("server-received"))))
.block(Duration.ofSeconds(30));
HttpClient client = HttpClient.create("localhost", server.address().getPort());
Mono<String> content = client.post("/test", req ->
req.addHeader("Content-Type", "text/plain")
.failOnClientError(false)
.sendString(Flux.just("Hello").log("client-send"))
)
.flatMap(res -> res.receive().asString().log("client-received").last());
StepVerifier.create(Flux.range(1, 1000).flatMap(d -> content).count())
.expectNext(1000L)
.expectComplete()
.verify(Duration.ofSeconds(30));
server.dispose();
} |
Hello, i will try to help to reproduce the issue on Wednesday when back from a business trip. |
Hello, apologize for the late response. Just tested with 0.7.2 and unfortunately the issue is still reproduced with out product. I am attaching reactor-netty logs. Inside are two folders - good (everything fine) and bad (the situation with an invalid encoder). In both cases we have small (24 bytes) and big (~48KB) files uploaded to a server, the only difference is the order in which files are sent. When the small file goes first - everything is fine and any subsequent number of bigger files can be uploaded successfully. However, in case the first uploaded file is the big one - it breaks encoder. All next operations using connection from the pool fail (409 error in the logs). Involved entities are the follows:
The only difference inside the logs which we spotted is the time when 404 error is received by HttpClient. In the bad case, it is received and responded back to curl in the middle of proxing operation. In the good case, this happens at the very end after transferred content is fully proxied. We checked content server, it behaves identically the same in both cases, it always returns 404 immediately after receiving HTTP POST request. |
A few more notes to the previous post: |
Hi I can confirm this bug on my side. Similar scenario, uploading file via https. Switching to http solves my problems. Bug occurs after: Normally it returns: |
@violetagg @smaldini I was able to reproduce a problem which might be similar to what was originally reported in this issue. The exception is the same, Steps to reproduce (tested with Ubuntu 16.04): Starting the test application, reactor-fibonacci:
then creating some requests in another terminal:
In my case, I get this error as the first error in
The next errors are
and
(lots of these in the log) There is an option to disable the connection pool. In that case these errors get printed to the log:
That could be a hint of the problem. One important detail is that reproducing the issue consistently required switching to use With debug logging there are a lot of these errors, could be another hint that helps investigation:
This is logged in HttpServerHandler. I hope this helps. |
@lhotari thanks I can reproduce it on Ubuntu. Debugging ... |
@violetagg What would be the a good way to debug this type of problem? |
@lhotari this one is a tough one. I put several very tiny logs here and there to understand what's going on. The important thing is to correlate the connection id. The problem is really weird and I'm still debugging it. |
@lhotari the test seems to be wrong here https://github.com/lhotari/reactor-fibonacci/blob/master/src/main/java/reactor/ipc/netty/profiling/ReactorFibonacci.java#L121, it should be flatMap not map. Currently, nothing is subscribing to the server request.receive() and consuming the response, meaning that the response is terminated before body is fully read - identified separately as #323 and #324. We have worked out something locally with @violetagg today and got rid of the IllegalStateException. However even with our local fix, the test starts hanging so we're not out of the water yet :( |
Thanks @smaldini and @violetagg . I fixed the problem you mentioned and also another bug in the test: lhotari/reactor-fibonacci@06463f9 . I also see the test hanging with both http (non-ssl) and https (ssl). I wonder if the test app now reproduces issue #316 ? |
@violetagg Yes, #327 fixes the problem that the test app started hanging. Great work! |
@violetagg The |
@lhotari Yes I believe so |
I tested the fix #343 and I wasn't able to reproduce the |
I'm seeing this issue with 0.7.7.RELEASE version (using Spring boot 2.0.2) Before seeing this exception there's this one in the logs:
and then:
|
@madgnome There might be different causes that lead to this Thanks, |
I tried to reproduce locally but no luck. Disabling the pool fixed the issue in prod. @violetagg Do you have any tips to reproduce this type of issue? |
I've turned on more logging but not all of it because it's too verbose unfortunately. Successful log sequences
Last log sequences before the first exception:
Shit starts to hit the fan hereLog when the SSL exception happens, just before getting in the bad state
Bad state log statements:
Hopefully that helps |
I've managed to reproduce with HttpClient logging handler logs enabled: Last successful log sequences:
Shit starts to hit the fan from here:Log when the SSL exception happens:
It looks like the ssl connection gets closed but the channel doesn't.
Looking at previous logs following a
|
I've created a separate issue to track this #361 |
We have to disable connection pool too. When the post requests contain a large body, it's easy to reproduce this issue. It's frustrating without the pool, the number of requests per second is very limited due to the limit of available ports. |
@anonymousXrandom What is the Reactor Netty version that you use? We have fixes for this issue. |
While testing WebClient acting as a proxy under load, we reproduce the issue when connection pool holds some connections with invalid state HttpObjectEncoder.
When client gets such connection from a pool and uses it to make request to a destination server, it immediately fails because DefaultHttpRequest cannot be encoded by HttpObjectEncoder due to the fact that encoders state is set to ST_CONTENT_NON_CHUNK (or any other), while it expects it to be ST_INIT. A kind of broken state machine inside the encoder.
The issue is mainly observed when the client proxies heavy upstream data (uploads content to a destination server).
Disabling connection pool fixes the thing.
Observed exception:
io.netty.handler.codec.EncoderException: java.lang.IllegalStateException: unexpected message type: DefaultHttpRequest
at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106) ~[netty-codec-4.1.16.Final.jar:4.1.16.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) ~[netty-transport-4.1.16.Final.jar:4.1.16.Final]
...
The text was updated successfully, but these errors were encountered: