Benchmark: HTTP2 fix cancel_stream_error #2612

dsmiley · 2024-08-01T20:46:29Z

Fixes Jetty HTTP2 cancel_stream_error by removing the flow control limit.

I was encountering this issue with CloudIndexing, esp when using single-node, single-replica, single-shard -- like 30K ops/sec. It'd routinely fail. I bumped this to 1000, then still hit it sometimes, then bumped to 2000. I haven't run it much since then to be honest. Then I saw how to disable it, which looks a little weird TBH (the default method returns a NO_RATE_CONTROL impl).

Fixes Jetty HTTP2 cancel_stream_error by increasing the flow control limit. jetty.http2.rateControl.maxEventsPerSecond=2000

dsmiley

DRAFT; I still see the error and I'm confused

iamsanjay · 2024-08-15T14:25:14Z

Could you please walk me through the process of running this specific benchmark which is giving the error, such as a command, or anything which helps me re-produce it on my system.

dsmiley · 2024-08-15T16:15:49Z

crave run -- ./jmh.sh CloudIndexing -jvmArgsAppend -Dsolr.bench.seed=823716671780822 -p useSmallDocs=true -p nodeCount=1 -p numShards=1 -p numReplicas=1 -I 8

The benchmark module uses JettySolrRunner via MiniClusterState.

Googling for this is inconclusive as to the cause. It seems the server is choosing to cancel the client's stream, but we don't know why. I dug into this before and thought it was a rate limiting mechanism, so this PR disables that. Perhaps the next step to debug is to ascertain what jetty config/logging would explain that. Would want to avoid debug level on everything -- too much output.

iamsanjay · 2024-08-22T04:04:56Z

I recall encountering a somewhat similar error in the IndexFetcher class when we transitioned to HTTP2. The IndexFetcher class was prematurely closing the client connection and sending a GOAWAY packet to the server. When the flow control limit was reached, it triggered an exception.

I suspect the same issue might be occurring in this case. Unfortunately, I can't run the crave script anymore since upgrading to the new Apple Silicon chip. I have to look into that.

iamsanjay · 2024-09-19T04:23:40Z

Was there anyway to enable more logs for benchmark? I wonder If it's too much overhead with the current Jetty configuration.

org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:57165/solr/testCollection/update?wt=javabin&version=2: Async exception during distributed update: Error from server at http://127.0.0.1:57164/solr: null



request: http://127.0.0.1:57164/solr
Remote error message: Server error writing document id 5234 to the index.
        at org.apache.solr.client.solrj.impl.HttpSolrClientBase.processErrorsAndResponse(HttpSolrClientBase.java:280)
        at org.apache.solr.client.solrj.impl.Http2SolrClient.processErrorsAndResponse(Http2SolrClient.java:549)
        at org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:505)
        at org.apache.solr.bench.index.CloudIndexing.indexDoc(CloudIndexing.java:138)
        at org.apache.solr.bench.index.jmh_generated.CloudIndexing_indexDoc_jmhTest.indexDoc_thrpt_jmhStub(CloudIndexing_indexDoc_jmhTest.java:242)
        at org.apache.solr.bench.index.jmh_generated.CloudIndexing_indexDoc_jmhTest.indexDoc_Throughput(CloudIndexing_indexDoc_jmhTest.java:111)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:527)

dsmiley · 2024-09-19T13:55:11Z

Maybe you are unfamiliar with Log4j but look at log4j2-bench.xml in this module.

iamsanjay · 2024-09-19T16:58:07Z

MaxConnectionsPerDestination 8

Benchmark               (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt      Score      Error  Units
CloudIndexing.indexDoc            4              1            4          50000            true  thrpt    4  16140.035 ± 1902.453  ops/s
CloudIndexing.indexDoc            4              1            4          50000           false  thrpt    4   1818.103 ±   42.214  ops/s
CloudIndexing.indexDoc            4              3            4          50000            true  thrpt    4   7280.289 ± 1774.366  ops/s
CloudIndexing.indexDoc            4              3            4          50000           false  thrpt    4   1546.771 ± 2012.373  ops/s

MaxConnectionsPerDestination 16

Benchmark               (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt      Score      Error  Units
CloudIndexing.indexDoc            4              1            4          50000            true  thrpt    4  16374.754 ± 1354.421  ops/s
CloudIndexing.indexDoc            4              1            4          50000           false  thrpt    4   2053.636 ±  625.117  ops/s
CloudIndexing.indexDoc            4              3            4          50000            true  thrpt    4   7919.480 ± 4504.359  ops/s
CloudIndexing.indexDoc            4              3            4          50000           false  thrpt    4   1760.070 ± 2226.050  ops/s

MaxConnectionsPerDestination 32

Benchmark               (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt      Score      Error  Units
CloudIndexing.indexDoc            4              1            4          50000            true  thrpt    4  15566.625 ± 1666.612  ops/s
CloudIndexing.indexDoc            4              1            4          50000           false  thrpt    4   2084.775 ±  239.749  ops/s
CloudIndexing.indexDoc            4              3            4          50000            true  thrpt    4   7505.612 ± 4698.342  ops/s
CloudIndexing.indexDoc            4              3            4          50000           false  thrpt    4   1684.325 ± 2826.760  ops/s

MaxConnectionsPerDestination 64

Benchmark               (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt      Score      Error  Units
CloudIndexing.indexDoc            4              1            4          50000            true  thrpt    4  15576.925 ± 1727.535  ops/s
CloudIndexing.indexDoc            4              1            4          50000           false  thrpt    4   2111.832 ±  189.408  ops/s
CloudIndexing.indexDoc            4              3            4          50000            true  thrpt    4   8069.933 ± 4096.516  ops/s
CloudIndexing.indexDoc            4              3            4          50000           false  thrpt    4   1706.464 ± 2694.685  ops/s

iamsanjay · 2024-09-19T17:00:56Z

One thing I noticed that the moment I increased the maxConnectionPerDestination to 8 and keep on increasing till 64, I did'nt see any exception, able to run the full benchmark without any issue.

Below I ran it for 2 and first benchmark immediately failed!

MaxConnectionsPerDestination 2

Benchmark               (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt     Score      Error  Units
CloudIndexing.indexDoc            4              1            4          50000           false  thrpt    4  1730.811 ±   99.685  ops/s
CloudIndexing.indexDoc            4              3            4          50000            true  thrpt    4  8144.242 ± 7693.263  ops/s

iamsanjay · 2024-09-19T17:20:39Z

MaxConnectionsPerDestination 128

Benchmark               (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt      Score      Error  Units
CloudIndexing.indexDoc            4              1            4          50000            true  thrpt    4  15310.374 ± 2095.507  ops/s
CloudIndexing.indexDoc            4              1            4          50000           false  thrpt    4   2066.279 ±   64.590  ops/s
CloudIndexing.indexDoc            4              3            4          50000            true  thrpt    4   7189.295 ± 5208.593  ops/s
CloudIndexing.indexDoc            4              3            4          50000           false  thrpt    4   1774.028 ± 2590.350  ops/s

dsmiley · 2024-09-19T18:04:41Z

Wow; good find! If setMaxConnectionsPerDestination(1) then what? I wonder what Jetty is supposed to do when this limit is approached -- like is cancel_stream_error expected in some sense or should it be doing something else? I suspect a Jetty issue.

I should mention that we'll somewhat soon in #2410 have maxConnectionsPerDestination set more appropriately (not 4!) and then this issue here maybe won't be experienced.

iamsanjay · 2024-09-20T16:52:56Z

Actually I am still not convinced that issue is related to maxConnectionPerDestination setting. Because today When I ran it again and again, I found the issue with higher values as well such as 8, 16, 32. As few runs cannot tell which one is better, We can set up a dedicated benchmark for it to run it for different values of maxConnectionPerDestination and maxConcurrentStream (though this one is server level setting which I am not yet sure how we will configure in benchmark.)

There is a possibility that Jetty is cancelling stream because one of the shard went into recovery!

Server exception => org.apache.solr.common.SolrException: ClusterState says we are the leader (http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4), but locally we don't think so. Request came from http://127.0.0.1:54736/solr/testCollection_shard2_replica_n16/
        at org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDefensiveChecks(DistributedZkUpdateProcessor.java:1144)
org.apache.solr.common.SolrException: ClusterState says we are the leader (http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4), but locally we don't think so. Request came from http://127.0.0.1:54736/solr/testCollection_shard2_replica_n16/

request: http://127.0.0.1:54734/solr
Remote error message: ClusterState says we are the leader (http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4), but locally we don't think so. Request came from http://127.0.0.1:54736/solr/testCollection_shard2_replica_n16/
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283) ~[solr-solrj-10.0.0
-SNAPSHOT.jar:10.0.0-SNAPSHOT 549b46da0c7030249c81f827b801bb24bd6b57cd [snapshot build, details omitted]]
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:184) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 549b46da0c7030249c81f827b801bb24bd6b57cd [snapshot build, details omitted]]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:212) ~[metrics-core-4.2.26.jar:4.2.26]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$1(ExecutorUtil.java:449) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNA
PSHOT 549b46da0c7030249c81f827b801bb24bd6b57cd [snapshot build, details omitted]]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.base/java.lang.Thread.run(Thread.java:829) [?:?]

2024-09-20 21:58:32.200 ERROR (updateExecutor-38-thread-11-processing-testCollection_shard3_replica_n2 null-30262 core_node5 null 127.0.0.1:54735_solr testCollection shard3) [c:testCollection s:shard3 r:core_node5 x:testCollection_shard3_replica_n2] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient
          Error when calling SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=ForwardNode: http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4/ to http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4/ => org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:54734/solr: null

This one looks different!

 Client exception => org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_
        at org.apache.solr.schema.SchemaField.checkFieldCacheSource(SchemaField.java:295)
org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_

iamsanjay · 2024-09-20T16:56:50Z

We can test the setup where we are trying to index document and shard is down (rewriting the zookeeper cluster state)and see If we are getting the same cancel_stream error.

Also In some cases, I encountered an Out of Memory exception, suggesting we might need to scale it down a bit, as the current setup seems somewhat unstable.

HoustonPutman · 2024-09-20T17:03:52Z

This one looks different!

 Client exception => org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_
        at org.apache.solr.schema.SchemaField.checkFieldCacheSource(SchemaField.java:295)
org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_

This error is not related, just make sure your version field has docValues enabled

dsmiley · 2024-09-20T18:51:52Z

There is a possibility that Jetty is cancelling stream because one of the shard went into recovery!

I'm surprised / suspicious app level concerns/errors (Solr returning an HTTP error to the client) would result in a lower level HTTP TCP stream cancellation of some kind.

iamsanjay · 2024-09-24T17:34:55Z

We are using jetty 10, right?
jetty/jetty.project#10485

dsmiley · 2024-09-24T18:13:48Z

Yes; our versions are in versions.props

iamsanjay · 2024-09-26T08:37:00Z

Latest Benchmark with different values for maxConnectionPerDestination

Benchmark               (maxConnectionPerHost)  (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt     Score      Error  Units
CloudIndexing.indexDoc                       1            4              3            4          50000           false  thrpt    4  1118.424 ±  847.096  ops/s
CloudIndexing.indexDoc                       2            4              3            4          50000           false  thrpt    4  1701.146 ± 1943.553  ops/s
CloudIndexing.indexDoc                       4            4              3            4          50000           false  thrpt    4  1726.712 ± 2448.684  ops/s
CloudIndexing.indexDoc                       8            4              3            4          50000           false  thrpt    4  1650.018 ± 2238.615  ops/s
CloudIndexing.indexDoc                      16            4              3            4          50000           false  thrpt    4  1651.497 ± 2117.253  ops/s
CloudIndexing.indexDoc                      32            4              3            4          50000           false  thrpt    4  1655.557 ± 2347.372  ops/s
CloudIndexing.indexDoc                      64            4              3            4          50000           false  thrpt    4  1695.647 ± 2139.217  ops/s
CloudIndexing.indexDoc                     128            4              3            4          50000           false  thrpt    4  1737.919 ± 2381.445  ops/s

It's a good improvement as we increase maxConnectionPerStream.

The pattern I see so far that If we increase the maxConnectionPerDestination then yes there are failures, but all of them are related to Out Of Memory exception rather than cancel stream error.

Clearly, increasing maxConnectionPerStream also requires more memory., which is understandable.

EDIT

Wait I was looking at the wrong value, actually there is no improvement as we increase maxConnectionPerDestination. Nothing!

iamsanjay · 2024-09-26T08:45:32Z

There should be way to provide some sort of setting by which user can control this setting and then can play with it. We do have maxConnectionPerDestination but that is being used by http1 connection. And currently the value for that is 100000, which in my opinion would be too much for http2 as we are still experimenting with it.

We can also introduce another setting for http2 in UpdateShardHandler.

iamsanjay · 2024-09-26T08:53:21Z

Yes; our versions are in versions.props

I inquired about the Jetty version because I posted a question regarding the cancel_stream error on the Jetty forum, and they informed me that Jetty 10 is at the end of the support and Please try Jetty 12 and report back if the problem still happens.

Note that Jetty 12 requires JDK 17!

iamsanjay · 2024-09-26T11:13:57Z

Benchmark               (maxConnectionPerHost)  (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt     Score      Error  Units
CloudIndexing.indexDoc                       1            4              3            4          50000           false  thrpt    4  1843.723 ± 2943.271  ops/s
CloudIndexing.indexDoc                       2            4              3            4          50000           false  thrpt    4  1756.116 ± 2323.291  ops/s
CloudIndexing.indexDoc                       4            4              3            4          50000           false  thrpt    4  1693.267 ± 2690.380  ops/s
CloudIndexing.indexDoc                       8            4              3            4          50000           false  thrpt    4  1598.918 ± 2505.014  ops/s
CloudIndexing.indexDoc                      16            4              3            4          50000           false  thrpt    4  1597.113 ± 2100.961  ops/s
CloudIndexing.indexDoc                      32            4              3            4          50000           false  thrpt    4  1694.114 ± 2307.802  ops/s
CloudIndexing.indexDoc                      64            4              3            4          50000           false  thrpt    4  1622.504 ± 2663.477  ops/s
CloudIndexing.indexDoc                     128            4              3            4          50000           false  thrpt    4  1633.025 ± 2277.204  ops/s
CloudIndexing.indexDoc                     256            4              3            4          50000           false  thrpt    4  1514.132 ± 2218.589  ops/s
CloudIndexing.indexDoc                     512            4              3            4          50000           false  thrpt    4  1645.532 ± 2375.068  ops/s
CloudIndexing.indexDoc                    1024            4              3            4          50000           false  thrpt    4  1590.969 ± 2128.108  ops/s

HoustonPutman · 2024-09-26T15:59:12Z

Thanks for getting those numbers, very interesting that it had no effect.

dsmiley · 2024-09-27T02:27:37Z

This is a low priority thing; feel free to abandon this until we get to Jetty 12 and JDK 17. In the mean time, we know maxConnectionPerDestination is misconfigured to be too low; will be increased by default quite a bit -- Houston; I'm looking at you :-)

dsmiley added 2 commits August 1, 2024 16:36

Benchmark: HTTP2 fix cancel_stream_error

a9eba56

Fixes Jetty HTTP2 cancel_stream_error by increasing the flow control limit. jetty.http2.rateControl.maxEventsPerSecond=2000

unlimited

50e0cee

dsmiley requested a review from stillalex August 1, 2024 20:46

github-actions bot added the test-framework label Aug 1, 2024

dsmiley commented Aug 1, 2024

View reviewed changes

dsmiley marked this pull request as draft August 22, 2024 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: HTTP2 fix cancel_stream_error #2612

Benchmark: HTTP2 fix cancel_stream_error #2612

dsmiley commented Aug 1, 2024

dsmiley left a comment

iamsanjay commented Aug 15, 2024

dsmiley commented Aug 15, 2024

iamsanjay commented Aug 22, 2024

iamsanjay commented Sep 19, 2024

dsmiley commented Sep 19, 2024

iamsanjay commented Sep 19, 2024

iamsanjay commented Sep 19, 2024 •

edited

Loading

iamsanjay commented Sep 19, 2024

dsmiley commented Sep 19, 2024

iamsanjay commented Sep 20, 2024

iamsanjay commented Sep 20, 2024

HoustonPutman commented Sep 20, 2024

dsmiley commented Sep 20, 2024

iamsanjay commented Sep 24, 2024

dsmiley commented Sep 24, 2024

iamsanjay commented Sep 26, 2024 •

edited

Loading

iamsanjay commented Sep 26, 2024

iamsanjay commented Sep 26, 2024

iamsanjay commented Sep 26, 2024

HoustonPutman commented Sep 26, 2024

dsmiley commented Sep 27, 2024

Benchmark: HTTP2 fix cancel_stream_error #2612

Are you sure you want to change the base?

Benchmark: HTTP2 fix cancel_stream_error #2612

Conversation

dsmiley commented Aug 1, 2024

dsmiley left a comment

Choose a reason for hiding this comment

iamsanjay commented Aug 15, 2024

dsmiley commented Aug 15, 2024

iamsanjay commented Aug 22, 2024

iamsanjay commented Sep 19, 2024

dsmiley commented Sep 19, 2024

iamsanjay commented Sep 19, 2024

MaxConnectionsPerDestination 8

MaxConnectionsPerDestination 16

MaxConnectionsPerDestination 32

MaxConnectionsPerDestination 64

iamsanjay commented Sep 19, 2024 • edited Loading

MaxConnectionsPerDestination 2

iamsanjay commented Sep 19, 2024

MaxConnectionsPerDestination 128

dsmiley commented Sep 19, 2024

iamsanjay commented Sep 20, 2024

iamsanjay commented Sep 20, 2024

HoustonPutman commented Sep 20, 2024

dsmiley commented Sep 20, 2024

iamsanjay commented Sep 24, 2024

dsmiley commented Sep 24, 2024

iamsanjay commented Sep 26, 2024 • edited Loading

iamsanjay commented Sep 26, 2024

iamsanjay commented Sep 26, 2024

iamsanjay commented Sep 26, 2024

HoustonPutman commented Sep 26, 2024

dsmiley commented Sep 27, 2024

iamsanjay commented Sep 19, 2024 •

edited

Loading

iamsanjay commented Sep 26, 2024 •

edited

Loading