[CI] multiple tests started failing with NoNodeAvailableException #37275

alpar-t · 2019-01-09T18:13:03Z

Seems to be triggered by the first test to run for the class.

Example build failure

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/170/console

Reproduction line

does not reproduce locally

./gradlew :x-pack:plugin:security:unitTest \
  -Dtests.seed=7030014DD5B5A0EB \
  -Dtests.class=org.elasticsearch.xpack.security.authc.TokenAuthIntegTests \
  -Dtests.method="testRefreshAsDifferentUser" \
  -Dtests.security.manager=true \
  -Dtests.locale=th \
  -Dtests.timezone=Pacific/Apia \
  -Dcompiler.java=12 \
  -Druntime.java=8

Example relevant log:

NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{tZDl3ofZSsybqfW_cz1-iw}{127.0.0.1}{127.0.0.1:35230}]]
	at __randomizedtesting.SeedInfo.seed([7030014DD5B5A0EB:E8D192FF96C2C519]:0)
	at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:351)
	at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:249)
	at org.elasticsearch.client.transport.TransportProxyClient.execute(TransportProxyClient.java:57)
	at org.elasticsearch.client.transport.TransportClient.doExecute(TransportClient.java:386)
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:393)
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:382)
	at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.execute(AbstractClient.java:1217)
	at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:45)
	at org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:52)
	at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked(ElasticsearchAssertions.java:113)
	at org.elasticsearch.test.TestCluster.wipeIndices(TestCluster.java:142)
	at org.elasticsearch.test.TestCluster.wipe(TestCluster.java:79)
	at org.elasticsearch.test.ESIntegTestCase.lambda$beforeInternal$0(ESIntegTestCase.java:378)
	at com.carrotsearch.randomizedtesting.RandomizedContext.runWithPrivateRandomness(RandomizedContext.java:187)
	at com.carrotsearch.randomizedtesting.RandomizedContext.runWithPrivateRandomness(RandomizedContext.java:211)
	at org.elasticsearch.test.ESIntegTestCase.beforeInternal(ESIntegTestCase.java:386)
	at org.elasticsearch.test.ESIntegTestCase.setupTestCluster(ESIntegTestCase.java:2217)
	at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:972)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.lang.Thread.run(Thread.java:748)

Frequency

Up to 5 times per day.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-01-09T18:13:06Z

Pinging @elastic/es-security

alpar-t · 2019-01-09T18:17:05Z

There seem to be a bunch of other tests failing recently with this exception too.
https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-90d,mode:relative,to:now))&_a=(columns:!(branch,build-id,build_url,error-type,test,stacktrace,class),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!t,index:e58bf320-7efd-11e8-bf69-63c8ef516157,key:branch,negate:!f,params:(query:master,type:phrase),type:phrase,value:master),query:(match:(branch:(query:master,type:phrase))))),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:d,query:(language:kuery,query:'stacktrace:%22NoNodeAvailableException%22'),sort:!(time,desc)) [auth required]

elasticmachine · 2019-01-09T18:53:30Z

Pinging @elastic/es-core-infra

ywelsch · 2019-01-10T14:35:44Z

The interesting log lines here seem to indicate a problem with establishing a connection. @tbrooks8 do you have an idea what might be going on here?

1> [2019-01-10T07:55:41,225][INFO ][o.e.x.s.a.l.LoggingAuditTrail] [node_sd1] action="cluster:monitor/nodes/liveness" event.action="access_granted" event.type="transport" host.ip="127.0.0.1" origin.address="127.0.0.1:56657" origin.type="transport" request.id="uog1J730SHSoWN1AV5tU-g" request.name="LivenessRequest" user.name="test_trans_client_user" user.realm="file" user.roles="["transport_client","user"]"
  1> [2019-01-10T07:55:51,830][WARN ][o.e.t.TcpTransport       ] [transport_client_node_sd1] send message failed [channel: Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:56675, remoteAddress=null}]
  1> java.nio.channels.ClosedChannelException: null
  1> 	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[netty-transport-4.1.32.Final.jar:4.1.32.Final]
  1> [2019-01-10T07:55:51,846][WARN ][o.e.t.TcpTransport       ] [node_sd1] exception caught on transport layer [TcpNioSocketChannel{localAddress=0.0.0.0/0.0.0.0:35230, remoteAddress=/127.0.0.1:56674}], closing connection
  1> javax.net.ssl.SSLException: Closed engine without completely sending the close alert message.
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLDriver.close(SSLDriver.java:165) ~[main/:?]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLChannelContext.closeFromSelector(SSLChannelContext.java:164) ~[main/:?]
  1> 	at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:208) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.postHandling(EventHandler.java:168) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:233) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:158) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:123) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
  1> [2019-01-10T07:55:51,848][WARN ][o.e.t.TcpTransport       ] [node_sd1] exception caught on transport layer [TcpNioSocketChannel{localAddress=0.0.0.0/0.0.0.0:35230, remoteAddress=/127.0.0.1:56678}], closing connection
  1> javax.net.ssl.SSLException: Closed engine without completely sending the close alert message.
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLDriver.close(SSLDriver.java:165) ~[main/:?]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLChannelContext.closeFromSelector(SSLChannelContext.java:164) ~[main/:?]
  1> 	at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:208) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.postHandling(EventHandler.java:168) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:233) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:158) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:123) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
  1> [2019-01-10T07:55:51,848][WARN ][o.e.t.TcpTransport       ] [node_sd1] exception caught on transport layer [TcpNioSocketChannel{localAddress=0.0.0.0/0.0.0.0:35230, remoteAddress=/127.0.0.1:56676}], closing connection
  1> javax.net.ssl.SSLException: Closed engine without completely sending the close alert message.
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLDriver.close(SSLDriver.java:165) ~[main/:?]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLChannelContext.closeFromSelector(SSLChannelContext.java:164) ~[main/:?]
  1> 	at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:208) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.postHandling(EventHandler.java:168) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:233) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:158) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:123) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
  1> [2019-01-10T07:55:51,849][WARN ][o.e.t.TcpTransport       ] [node_sd1] exception caught on transport layer [TcpNioSocketChannel{localAddress=0.0.0.0/0.0.0.0:35230, remoteAddress=/127.0.0.1:56672}], closing connection
  1> javax.net.ssl.SSLException: Closed engine without completely sending the close alert message.
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLDriver.close(SSLDriver.java:165) ~[main/:?]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLChannelContext.closeFromSelector(SSLChannelContext.java:164) ~[main/:?]
  1> 	at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:208) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.postHandling(EventHandler.java:168) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:233) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:158) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:123) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
  1> [2019-01-10T07:55:51,849][WARN ][o.e.t.TcpTransport       ] [node_sd1] exception caught on transport layer [TcpNioSocketChannel{localAddress=0.0.0.0/0.0.0.0:35230, remoteAddress=/127.0.0.1:56696}], closing connection
  1> javax.net.ssl.SSLException: Closed engine without completely sending the close alert message.
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLDriver.close(SSLDriver.java:165) ~[main/:?]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLChannelContext.closeFromSelector(SSLChannelContext.java:164) ~[main/:?]
  1> 	at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:208) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.postHandling(EventHandler.java:168) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:233) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:158) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:123) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
  1> [2019-01-10T07:55:51,852][WARN ][o.e.t.TcpTransport       ] [node_sd1] exception caught on transport layer [TcpNioSocketChannel{localAddress=0.0.0.0/0.0.0.0:35230, remoteAddress=/127.0.0.1:56675}], closing connection
  1> javax.net.ssl.SSLException: Closed engine without completely sending the close alert message.
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLDriver.close(SSLDriver.java:165) ~[main/:?]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLChannelContext.closeFromSelector(SSLChannelContext.java:164) ~[main/:?]
  1> 	at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:208) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.postHandling(EventHandler.java:168) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:233) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:158) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:123) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
  1> [2019-01-10T07:55:51,870][WARN ][o.e.x.s.t.n.SecurityNioTransport] [node_sd1] connection TcpNioSocketChannel{localAddress=/127.0.0.1:35230, remoteAddress=/127.0.0.1:56682} closed during handshake
  1> [2019-01-10T07:55:51,911][WARN ][o.e.x.s.t.n.SecurityNioTransport] [node_sd1] connection TcpNioSocketChannel{localAddress=/127.0.0.1:35230, remoteAddress=/127.0.0.1:56680} closed during handshake
  1> [2019-01-10T07:55:51,911][WARN ][o.e.x.s.t.n.SecurityNioTransport] [node_sd1] connection TcpNioSocketChannel{localAddress=/127.0.0.1:35230, remoteAddress=/127.0.0.1:56695} closed during handshake
  1> [2019-01-10T07:55:51,911][WARN ][o.e.x.s.t.n.SecurityNioTransport] [node_sd1] connection TcpNioSocketChannel{localAddress=/127.0.0.1:35230, remoteAddress=/127.0.0.1:56681} closed during handshake

s1monw · 2019-01-10T14:35:49Z

@tbrooks8 I see some SSL issues here can this be related to #37246

Tim-Brooks · 2019-01-10T18:02:17Z

@tbrooks8 I see some SSL issues here can this be related to #37246

This build was run on commit eacc63b032cb0cb055c41830e2d2eb29e61be030 which appears to predate the merging of that PR.

I see a variety of exceptions there:

[2019-01-10T07:55:51,870][WARN ][o.e.x.s.t.n.SecurityNioTransport] [node_sd1] connection TcpNioSocketChannel{localAddress=/127.0.0.1:35230, remoteAddress=/127.0.0.1:56682} closed during handshake

This means that this transport (SecurityNioTransport) received a TLS CLOSE_NOTIFY alert from the other side, while it was performing the TLS handshake.

This does not tell us much besides the fact that the other side wanted to close the connection before the handshake was complete (and it wanted it enough to send a close_notify).

  1> [2019-01-10T07:55:51,852][WARN ][o.e.t.TcpTransport       ] [node_sd1] exception caught on transport layer [TcpNioSocketChannel{localAddress=0.0.0.0/0.0.0.0:35230, remoteAddress=/127.0.0.1:56675}], closing connection
  1> javax.net.ssl.SSLException: Closed engine without completely sending the close alert message.
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLDriver.close(SSLDriver.java:165) ~[main/:?]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLChannelContext.closeFromSelector(SSLChannelContext.java:164) ~[main/:?]
  1> 	at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:208) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.postHandling(EventHandler.java:168) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:233) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:158) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:123) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]

This means that we are closing the channel before we could send the CLOSE_NOTIFY message. This happens because:

IOException marks a channel for immediate close.
Received TCP FIN from the other-side. Technically this could be considered a half-close and we could continue to send data. But this is not easily accessible using the JAVA APIs and neither Netty nor the NIO transport will half-close once they have sent close_notify. Both versions wait to send and received and then perform full close.
Close timeout (implemented in a new PR, but not here yet)

  1> [2019-01-10T07:55:51,830][WARN ][o.e.t.TcpTransport       ] [transport_client_node_sd1] send message failed [channel: Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:56675, remoteAddress=null}]
  1> java.nio.channels.ClosedChannelException: null
  1> 	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[netty-transport-4.1.32.Final.jar:4.1.32.Final]

This is a netty transport failing to send a message because the channel is closed. This test is a mix of nio and netty transports communicating.

Unfortunately I do not see an obvious underlying cause. Just messages indicating that channels were closed. Sometimes not in a clean manner. Important to note, it is possible that the transports using netty at not fully sending and receiving their close_notifys. Netty swallows those exceptions (many implementations in the wild and most web browsers do not send and receive). Some of these exceptions are kind noisy. I need to think about if these are all warn level exceptions.

original-brownbear · 2019-01-14T14:29:18Z

This happened again in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1244/consoleText
... but what's interesting is that I see this in the logs:

  1> [2019-01-14T04:49:54,205][WARN ][o.e.t.TransportService   ] [node_s0] Received response for a request that has timed out, sent [79249ms] ago, timed out [49229ms] ago, action [internal:coordination/fault_detection/follower_check], node [{node_s1}{pU5Mqt6NTdGlRQeM2yPq1A}{t0m_nQqPSE29teP-uA0jTA}{127.0.0.1}{127.0.0.1:37747}], id [1736]
  1> [2019-01-14T04:49:54,205][WARN ][o.e.t.TransportService   ] [node_s1] Received response for a request that has timed out, sent [80247ms] ago, timed out [50228ms] ago, action [internal:coordination/fault_detection/leader_check], node [{node_s0}{pJFwRlFTTEWeDfWHMnR7hg}{77np0m1uT-aWSuIPpGUkJw}{127.0.0.1}{127.0.0.1:38101}], id [1448]
  1> [2019-01-14T04:49:54,206][WARN ][o.e.t.TransportService   ] [node_s0] Received response for a request that has timed out, sent [48229ms] ago, timed out [18211ms] ago, action [internal:coordination/fault_detection/follower_check], node [{node_s1}{pU5Mqt6NTdGlRQeM2yPq1A}{t0m_nQqPSE29teP-uA0jTA}{127.0.0.1}{127.0.0.1:37747}], id [1741]
  1> [2019-01-14T04:49:54,207][WARN ][o.e.t.TransportService   ] [node_s0] Received response for a request that has timed out, sent [52231ms] ago, timed out [37222ms] ago, action [cluster:monitor/nodes/stats[n]], node [{node_s1}{pU5Mqt6NTdGlRQeM2yPq1A}{t0m_nQqPSE29teP-uA0jTA}{127.0.0.1}{127.0.0.1:37747}], id [1738]
  1> [2019-01-14T04:49:54,207][WARN ][o.e.t.TransportService   ] [node_s1] Received response for a request that has timed out, sent [49228ms] ago, timed out [19211ms] ago, action [internal:coordination/fault_detection/leader_check], node [{node_s0}{pJFwRlFTTEWeDfWHMnR7hg}{77np0m1uT-aWSuIPpGUkJw}{127.0.0.1}{127.0.0.1:38101}], id [1449]

also:

  1> [2019-01-14T19:47:36,975][WARN ][o.e.n.Node               ] [node_t0] timed out while waiting for initial discovery state - timeout: 30s

Seems like the whole test cluster froze up for quite a while and then came back to life eventually.
The initial test run that failed is also relatively slow (though only off from local performance by a factor of ~3-4.

... The above case also doesn't run into any SSLException, maybe the fact that we're seeing the SSL connections disconnecting above is just a symptom of some timeout again?

jkakavas · 2019-01-16T16:38:34Z

This happens quite a few times recently in 6.x also, not just in master:
Examples are
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1020
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1036
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1019

alpar-t · 2019-01-21T10:34:23Z

Occurrences seem to be down and localized to a specific worker:
non public build stats

original-brownbear · 2019-01-23T07:32:52Z

@atorok yea seems moving to 64GB memory on the workers fixed this pretty much. I guess we can close here?

alpar-t · 2019-01-23T10:32:36Z

Infra changes seem to have helped allot , but according to build stats we still see ~ 10 failures like this per day, that's still fairly high.

original-brownbear · 2019-02-15T18:23:04Z

@atorok yea you're right, this seems to have other non-resource reasons apparently from looking at the timings. See #37965 (comment) if interested :) I'm investigating this currently

original-brownbear · 2019-02-19T14:58:41Z

Fixed one (potentially not so special) case that leads to this failure in #39118

* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes #37965 * Closes #37275 * Closes #37345

* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes elastic#37965 * Closes elastic#37275 * Closes elastic#37345

) * Simplify and Fix Synchronization in InternalTestCluster (#39168) * Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes #37965 * Closes #37275 * Closes #37345

* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes elastic#37965 * Closes elastic#37275 * Closes elastic#37345

) * Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes #37965 * Closes #37275 * Closes #37345

alpar-t added >test-failure Triaged test failures from CI v7.0.0 :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) v6.7.0 labels Jan 9, 2019

alpar-t changed the title ~~[CI] TokenAuthIntegTests fails with NoNodeAvailableException~~ [CI] multiple tests started failing with NoNodeAvailableException Jan 9, 2019

alpar-t closed this as completed Jan 9, 2019

alpar-t reopened this Jan 9, 2019

alpar-t added :Delivery/Build Build or test infrastructure :Core/Infra/Core Core issues without another label labels Jan 9, 2019

alpar-t removed the :Core/Infra/Core Core issues without another label label Jan 9, 2019

original-brownbear self-assigned this Jan 10, 2019

ywelsch mentioned this issue Jan 14, 2019

[CI] MasterDisruptionIT.testVerifyApiBlocksDuringPartition #37276

Closed

DaveCTurner mentioned this issue Jan 16, 2019

Bootstrap a Zen2 cluster once quorum is discovered #37463

Merged

original-brownbear removed their assignment Jan 31, 2019

danielmitterdorfer mentioned this issue Feb 1, 2019

[CI] HotThreadsIT.testHotThreadsDontFail failure on master #37965

Closed

henningandersen self-assigned this Feb 1, 2019

jasontedor added v8.0.0 and removed v7.0.0 labels Feb 6, 2019

danielmitterdorfer added the v7.2.0 label Feb 7, 2019

danielmitterdorfer removed the v6.7.0 label Feb 7, 2019

henningandersen removed their assignment Feb 15, 2019

original-brownbear self-assigned this Feb 15, 2019

jkakavas removed the :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) label Feb 18, 2019

This was referenced Feb 19, 2019

Add Checks for Closed Channel in Selector Loop #39096

Merged

Fix Deadlock in HotThreadsIT #39118

Closed

original-brownbear mentioned this issue Feb 20, 2019

Simplify and Fix Synchronization in InternalTestCluster #39168

Merged

original-brownbear closed this as completed in #39168 Feb 21, 2019

original-brownbear mentioned this issue Feb 21, 2019

Simplify and Fix Synchronization in InternalTestCluster (#39168) #39241

Merged

original-brownbear mentioned this issue Mar 13, 2019

Simplify and Fix Synchronization in InternalTestCluster (#39168) #40013

Merged

mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] multiple tests started failing with NoNodeAvailableException #37275

[CI] multiple tests started failing with NoNodeAvailableException #37275

alpar-t commented Jan 9, 2019 •

edited by DaveCTurner

Loading

elasticmachine commented Jan 9, 2019

alpar-t commented Jan 9, 2019

elasticmachine commented Jan 9, 2019

ywelsch commented Jan 10, 2019

s1monw commented Jan 10, 2019

Tim-Brooks commented Jan 10, 2019 •

edited

Loading

original-brownbear commented Jan 14, 2019 •

edited

Loading

jkakavas commented Jan 16, 2019

alpar-t commented Jan 21, 2019

original-brownbear commented Jan 23, 2019

alpar-t commented Jan 23, 2019

original-brownbear commented Feb 15, 2019

original-brownbear commented Feb 19, 2019

[CI] multiple tests started failing with NoNodeAvailableException #37275

[CI] multiple tests started failing with NoNodeAvailableException #37275

Comments

alpar-t commented Jan 9, 2019 • edited by DaveCTurner Loading

Example build failure

Reproduction line

Example relevant log:

Frequency

elasticmachine commented Jan 9, 2019

alpar-t commented Jan 9, 2019

elasticmachine commented Jan 9, 2019

ywelsch commented Jan 10, 2019

s1monw commented Jan 10, 2019

Tim-Brooks commented Jan 10, 2019 • edited Loading

original-brownbear commented Jan 14, 2019 • edited Loading

jkakavas commented Jan 16, 2019

alpar-t commented Jan 21, 2019

original-brownbear commented Jan 23, 2019

alpar-t commented Jan 23, 2019

original-brownbear commented Feb 15, 2019

original-brownbear commented Feb 19, 2019

alpar-t commented Jan 9, 2019 •

edited by DaveCTurner

Loading

Tim-Brooks commented Jan 10, 2019 •

edited

Loading

original-brownbear commented Jan 14, 2019 •

edited

Loading