Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] multiple tests started failing with NoNodeAvailableException #37275

Closed
alpar-t opened this issue Jan 9, 2019 · 13 comments · Fixed by #39168
Closed

[CI] multiple tests started failing with NoNodeAvailableException #37275

alpar-t opened this issue Jan 9, 2019 · 13 comments · Fixed by #39168
Assignees
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI v7.2.0 v8.0.0-alpha1

Comments

@alpar-t
Copy link
Contributor

alpar-t commented Jan 9, 2019

Seems to be triggered by the first test to run for the class.

Example build failure

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/170/console

Reproduction line

does not reproduce locally

./gradlew :x-pack:plugin:security:unitTest \
  -Dtests.seed=7030014DD5B5A0EB \
  -Dtests.class=org.elasticsearch.xpack.security.authc.TokenAuthIntegTests \
  -Dtests.method="testRefreshAsDifferentUser" \
  -Dtests.security.manager=true \
  -Dtests.locale=th \
  -Dtests.timezone=Pacific/Apia \
  -Dcompiler.java=12 \
  -Druntime.java=8

Example relevant log:

NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{tZDl3ofZSsybqfW_cz1-iw}{127.0.0.1}{127.0.0.1:35230}]]
	at __randomizedtesting.SeedInfo.seed([7030014DD5B5A0EB:E8D192FF96C2C519]:0)
	at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:351)
	at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:249)
	at org.elasticsearch.client.transport.TransportProxyClient.execute(TransportProxyClient.java:57)
	at org.elasticsearch.client.transport.TransportClient.doExecute(TransportClient.java:386)
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:393)
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:382)
	at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.execute(AbstractClient.java:1217)
	at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:45)
	at org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:52)
	at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked(ElasticsearchAssertions.java:113)
	at org.elasticsearch.test.TestCluster.wipeIndices(TestCluster.java:142)
	at org.elasticsearch.test.TestCluster.wipe(TestCluster.java:79)
	at org.elasticsearch.test.ESIntegTestCase.lambda$beforeInternal$0(ESIntegTestCase.java:378)
	at com.carrotsearch.randomizedtesting.RandomizedContext.runWithPrivateRandomness(RandomizedContext.java:187)
	at com.carrotsearch.randomizedtesting.RandomizedContext.runWithPrivateRandomness(RandomizedContext.java:211)
	at org.elasticsearch.test.ESIntegTestCase.beforeInternal(ESIntegTestCase.java:386)
	at org.elasticsearch.test.ESIntegTestCase.setupTestCluster(ESIntegTestCase.java:2217)
	at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:972)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.lang.Thread.run(Thread.java:748)

Frequency

Up to 5 times per day.

@alpar-t alpar-t added >test-failure Triaged test failures from CI v7.0.0 :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) v6.7.0 labels Jan 9, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security

@alpar-t alpar-t changed the title [CI] TokenAuthIntegTests fails with NoNodeAvailableException [CI] multiple tests started failing with NoNodeAvailableException Jan 9, 2019
@alpar-t alpar-t closed this as completed Jan 9, 2019
@alpar-t alpar-t reopened this Jan 9, 2019
@alpar-t alpar-t added :Delivery/Build Build or test infrastructure :Core/Infra/Core Core issues without another label labels Jan 9, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@alpar-t alpar-t removed the :Core/Infra/Core Core issues without another label label Jan 9, 2019
@ywelsch
Copy link
Contributor

ywelsch commented Jan 10, 2019

The interesting log lines here seem to indicate a problem with establishing a connection. @tbrooks8 do you have an idea what might be going on here?

1> [2019-01-10T07:55:41,225][INFO ][o.e.x.s.a.l.LoggingAuditTrail] [node_sd1] action="cluster:monitor/nodes/liveness" event.action="access_granted" event.type="transport" host.ip="127.0.0.1" origin.address="127.0.0.1:56657" origin.type="transport" request.id="uog1J730SHSoWN1AV5tU-g" request.name="LivenessRequest" user.name="test_trans_client_user" user.realm="file" user.roles="["transport_client","user"]"
  1> [2019-01-10T07:55:51,830][WARN ][o.e.t.TcpTransport       ] [transport_client_node_sd1] send message failed [channel: Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:56675, remoteAddress=null}]
  1> java.nio.channels.ClosedChannelException: null
  1> 	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[netty-transport-4.1.32.Final.jar:4.1.32.Final]
  1> [2019-01-10T07:55:51,846][WARN ][o.e.t.TcpTransport       ] [node_sd1] exception caught on transport layer [TcpNioSocketChannel{localAddress=0.0.0.0/0.0.0.0:35230, remoteAddress=/127.0.0.1:56674}], closing connection
  1> javax.net.ssl.SSLException: Closed engine without completely sending the close alert message.
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLDriver.close(SSLDriver.java:165) ~[main/:?]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLChannelContext.closeFromSelector(SSLChannelContext.java:164) ~[main/:?]
  1> 	at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:208) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.postHandling(EventHandler.java:168) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:233) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:158) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:123) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
  1> [2019-01-10T07:55:51,848][WARN ][o.e.t.TcpTransport       ] [node_sd1] exception caught on transport layer [TcpNioSocketChannel{localAddress=0.0.0.0/0.0.0.0:35230, remoteAddress=/127.0.0.1:56678}], closing connection
  1> javax.net.ssl.SSLException: Closed engine without completely sending the close alert message.
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLDriver.close(SSLDriver.java:165) ~[main/:?]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLChannelContext.closeFromSelector(SSLChannelContext.java:164) ~[main/:?]
  1> 	at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:208) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.postHandling(EventHandler.java:168) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:233) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:158) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:123) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
  1> [2019-01-10T07:55:51,848][WARN ][o.e.t.TcpTransport       ] [node_sd1] exception caught on transport layer [TcpNioSocketChannel{localAddress=0.0.0.0/0.0.0.0:35230, remoteAddress=/127.0.0.1:56676}], closing connection
  1> javax.net.ssl.SSLException: Closed engine without completely sending the close alert message.
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLDriver.close(SSLDriver.java:165) ~[main/:?]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLChannelContext.closeFromSelector(SSLChannelContext.java:164) ~[main/:?]
  1> 	at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:208) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.postHandling(EventHandler.java:168) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:233) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:158) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:123) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
  1> [2019-01-10T07:55:51,849][WARN ][o.e.t.TcpTransport       ] [node_sd1] exception caught on transport layer [TcpNioSocketChannel{localAddress=0.0.0.0/0.0.0.0:35230, remoteAddress=/127.0.0.1:56672}], closing connection
  1> javax.net.ssl.SSLException: Closed engine without completely sending the close alert message.
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLDriver.close(SSLDriver.java:165) ~[main/:?]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLChannelContext.closeFromSelector(SSLChannelContext.java:164) ~[main/:?]
  1> 	at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:208) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.postHandling(EventHandler.java:168) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:233) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:158) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:123) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
  1> [2019-01-10T07:55:51,849][WARN ][o.e.t.TcpTransport       ] [node_sd1] exception caught on transport layer [TcpNioSocketChannel{localAddress=0.0.0.0/0.0.0.0:35230, remoteAddress=/127.0.0.1:56696}], closing connection
  1> javax.net.ssl.SSLException: Closed engine without completely sending the close alert message.
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLDriver.close(SSLDriver.java:165) ~[main/:?]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLChannelContext.closeFromSelector(SSLChannelContext.java:164) ~[main/:?]
  1> 	at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:208) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.postHandling(EventHandler.java:168) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:233) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:158) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:123) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
  1> [2019-01-10T07:55:51,852][WARN ][o.e.t.TcpTransport       ] [node_sd1] exception caught on transport layer [TcpNioSocketChannel{localAddress=0.0.0.0/0.0.0.0:35230, remoteAddress=/127.0.0.1:56675}], closing connection
  1> javax.net.ssl.SSLException: Closed engine without completely sending the close alert message.
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLDriver.close(SSLDriver.java:165) ~[main/:?]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLChannelContext.closeFromSelector(SSLChannelContext.java:164) ~[main/:?]
  1> 	at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:208) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.postHandling(EventHandler.java:168) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:233) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:158) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:123) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
  1> [2019-01-10T07:55:51,870][WARN ][o.e.x.s.t.n.SecurityNioTransport] [node_sd1] connection TcpNioSocketChannel{localAddress=/127.0.0.1:35230, remoteAddress=/127.0.0.1:56682} closed during handshake
  1> [2019-01-10T07:55:51,911][WARN ][o.e.x.s.t.n.SecurityNioTransport] [node_sd1] connection TcpNioSocketChannel{localAddress=/127.0.0.1:35230, remoteAddress=/127.0.0.1:56680} closed during handshake
  1> [2019-01-10T07:55:51,911][WARN ][o.e.x.s.t.n.SecurityNioTransport] [node_sd1] connection TcpNioSocketChannel{localAddress=/127.0.0.1:35230, remoteAddress=/127.0.0.1:56695} closed during handshake
  1> [2019-01-10T07:55:51,911][WARN ][o.e.x.s.t.n.SecurityNioTransport] [node_sd1] connection TcpNioSocketChannel{localAddress=/127.0.0.1:35230, remoteAddress=/127.0.0.1:56681} closed during handshake

@s1monw
Copy link
Contributor

s1monw commented Jan 10, 2019

@tbrooks8 I see some SSL issues here can this be related to #37246

@original-brownbear original-brownbear self-assigned this Jan 10, 2019
@Tim-Brooks
Copy link
Contributor

Tim-Brooks commented Jan 10, 2019

@tbrooks8 I see some SSL issues here can this be related to #37246

This build was run on commit eacc63b032cb0cb055c41830e2d2eb29e61be030 which appears to predate the merging of that PR.

I see a variety of exceptions there:

[2019-01-10T07:55:51,870][WARN ][o.e.x.s.t.n.SecurityNioTransport] [node_sd1] connection TcpNioSocketChannel{localAddress=/127.0.0.1:35230, remoteAddress=/127.0.0.1:56682} closed during handshake

This means that this transport (SecurityNioTransport) received a TLS CLOSE_NOTIFY alert from the other side, while it was performing the TLS handshake.

This does not tell us much besides the fact that the other side wanted to close the connection before the handshake was complete (and it wanted it enough to send a close_notify).

  1> [2019-01-10T07:55:51,852][WARN ][o.e.t.TcpTransport       ] [node_sd1] exception caught on transport layer [TcpNioSocketChannel{localAddress=0.0.0.0/0.0.0.0:35230, remoteAddress=/127.0.0.1:56675}], closing connection
  1> javax.net.ssl.SSLException: Closed engine without completely sending the close alert message.
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLDriver.close(SSLDriver.java:165) ~[main/:?]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.xpack.security.transport.nio.SSLChannelContext.closeFromSelector(SSLChannelContext.java:164) ~[main/:?]
  1> 	at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:208) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.postHandling(EventHandler.java:168) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:233) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:158) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:123) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]

This means that we are closing the channel before we could send the CLOSE_NOTIFY message. This happens because:

  1. IOException marks a channel for immediate close.
  2. Received TCP FIN from the other-side. Technically this could be considered a half-close and we could continue to send data. But this is not easily accessible using the JAVA APIs and neither Netty nor the NIO transport will half-close once they have sent close_notify. Both versions wait to send and received and then perform full close.
  3. Close timeout (implemented in a new PR, but not here yet)
  1> [2019-01-10T07:55:51,830][WARN ][o.e.t.TcpTransport       ] [transport_client_node_sd1] send message failed [channel: Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:56675, remoteAddress=null}]
  1> java.nio.channels.ClosedChannelException: null
  1> 	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[netty-transport-4.1.32.Final.jar:4.1.32.Final]

This is a netty transport failing to send a message because the channel is closed. This test is a mix of nio and netty transports communicating.

Unfortunately I do not see an obvious underlying cause. Just messages indicating that channels were closed. Sometimes not in a clean manner. Important to note, it is possible that the transports using netty at not fully sending and receiving their close_notifys. Netty swallows those exceptions (many implementations in the wild and most web browsers do not send and receive). Some of these exceptions are kind noisy. I need to think about if these are all warn level exceptions.

@original-brownbear
Copy link
Member

original-brownbear commented Jan 14, 2019

This happened again in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1244/consoleText
... but what's interesting is that I see this in the logs:

  1> [2019-01-14T04:49:54,205][WARN ][o.e.t.TransportService   ] [node_s0] Received response for a request that has timed out, sent [79249ms] ago, timed out [49229ms] ago, action [internal:coordination/fault_detection/follower_check], node [{node_s1}{pU5Mqt6NTdGlRQeM2yPq1A}{t0m_nQqPSE29teP-uA0jTA}{127.0.0.1}{127.0.0.1:37747}], id [1736]
  1> [2019-01-14T04:49:54,205][WARN ][o.e.t.TransportService   ] [node_s1] Received response for a request that has timed out, sent [80247ms] ago, timed out [50228ms] ago, action [internal:coordination/fault_detection/leader_check], node [{node_s0}{pJFwRlFTTEWeDfWHMnR7hg}{77np0m1uT-aWSuIPpGUkJw}{127.0.0.1}{127.0.0.1:38101}], id [1448]
  1> [2019-01-14T04:49:54,206][WARN ][o.e.t.TransportService   ] [node_s0] Received response for a request that has timed out, sent [48229ms] ago, timed out [18211ms] ago, action [internal:coordination/fault_detection/follower_check], node [{node_s1}{pU5Mqt6NTdGlRQeM2yPq1A}{t0m_nQqPSE29teP-uA0jTA}{127.0.0.1}{127.0.0.1:37747}], id [1741]
  1> [2019-01-14T04:49:54,207][WARN ][o.e.t.TransportService   ] [node_s0] Received response for a request that has timed out, sent [52231ms] ago, timed out [37222ms] ago, action [cluster:monitor/nodes/stats[n]], node [{node_s1}{pU5Mqt6NTdGlRQeM2yPq1A}{t0m_nQqPSE29teP-uA0jTA}{127.0.0.1}{127.0.0.1:37747}], id [1738]
  1> [2019-01-14T04:49:54,207][WARN ][o.e.t.TransportService   ] [node_s1] Received response for a request that has timed out, sent [49228ms] ago, timed out [19211ms] ago, action [internal:coordination/fault_detection/leader_check], node [{node_s0}{pJFwRlFTTEWeDfWHMnR7hg}{77np0m1uT-aWSuIPpGUkJw}{127.0.0.1}{127.0.0.1:38101}], id [1449]

also:

  1> [2019-01-14T19:47:36,975][WARN ][o.e.n.Node               ] [node_t0] timed out while waiting for initial discovery state - timeout: 30s

Seems like the whole test cluster froze up for quite a while and then came back to life eventually.
The initial test run that failed is also relatively slow (though only off from local performance by a factor of ~3-4.

... The above case also doesn't run into any SSLException, maybe the fact that we're seeing the SSL connections disconnecting above is just a symptom of some timeout again?

@alpar-t
Copy link
Contributor Author

alpar-t commented Jan 21, 2019

Occurrences seem to be down and localized to a specific worker:
non public build stats

@original-brownbear
Copy link
Member

@atorok yea seems moving to 64GB memory on the workers fixed this pretty much. I guess we can close here?

@alpar-t
Copy link
Contributor Author

alpar-t commented Jan 23, 2019

Infra changes seem to have helped allot , but according to build stats we still see ~ 10 failures like this per day, that's still fairly high.

@henningandersen henningandersen removed their assignment Feb 15, 2019
@original-brownbear original-brownbear self-assigned this Feb 15, 2019
@original-brownbear
Copy link
Member

@atorok yea you're right, this seems to have other non-resource reasons apparently from looking at the timings. See #37965 (comment) if interested :) I'm investigating this currently

@jkakavas jkakavas removed the :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) label Feb 18, 2019
@original-brownbear
Copy link
Member

Fixed one (potentially not so special) case that leads to this failure in #39118

original-brownbear added a commit that referenced this issue Feb 21, 2019
* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes #37965
* Closes #37275
* Closes #37345
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 21, 2019
* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes elastic#37965
* Closes elastic#37275
* Closes elastic#37345
original-brownbear added a commit that referenced this issue Feb 21, 2019
)

* Simplify and Fix Synchronization in InternalTestCluster (#39168)

* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes #37965
* Closes #37275
* Closes #37345
weizijun pushed a commit to weizijun/elasticsearch that referenced this issue Feb 22, 2019
* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes elastic#37965
* Closes elastic#37275
* Closes elastic#37345
weizijun pushed a commit to weizijun/elasticsearch that referenced this issue Feb 22, 2019
* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes elastic#37965
* Closes elastic#37275
* Closes elastic#37345
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Mar 13, 2019
* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes elastic#37965
* Closes elastic#37275
* Closes elastic#37345
original-brownbear added a commit that referenced this issue Mar 14, 2019
)

* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes #37965
* Closes #37275
* Closes #37345
@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging a pull request may close this issue.