Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] NodeTests.testAwaitCloseTimeoutsOnNonInterruptibleTask failure #44435

Closed
matriv opened this issue Jul 16, 2019 · 3 comments · Fixed by #49202
Closed

[CI] NodeTests.testAwaitCloseTimeoutsOnNonInterruptibleTask failure #44435

matriv opened this issue Jul 16, 2019 · 3 comments · Fixed by #49202
Assignees
Labels
:Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI

Comments

@matriv
Copy link
Contributor

matriv commented Jul 16, 2019

CI link: https://scans.gradle.com/s/ioo3aduuax56w/console-log?task=:server:test Failed for 7.x

Cannot reproduce locally:

./gradlew :server:test --tests "org.elasticsearch.node.NodeTests.testAwaitCloseTimeoutsOnNonInterruptibleTask" \
  -Dtests.seed=4DD0445A73A34598 \
  -Dtests.security.manager=true \
  -Dtests.locale=fi-FI \
  -Dtests.timezone=America/Los_Angeles \
  -Dcompiler.java=12 \
  -Druntime.java=8

Stacktrace:


2> heinäkuuta 16, 2019 7:16:00 AP. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
--
2> WARNING: Uncaught exception in thread: Thread[elasticsearch[mock_ node][generic][T#4],5,TGRP-NodeTests]
2> java.lang.IllegalStateException: threadcontext is already closed
2> 	at __randomizedtesting.SeedInfo.seed([4DD0445A73A34598]:0)
2> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextThreadLocal.ensureOpen(ThreadContext.java:673)
2> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextThreadLocal.get(ThreadContext.java:666)
2> 	at org.elasticsearch.common.util.concurrent.ThreadContext.newStoredContext(ThreadContext.java:206)
2> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.<init>(ThreadContext.java:693)
2> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.<init>(ThreadContext.java:688)
2> 	at org.elasticsearch.common.util.concurrent.ThreadContext.preserveContext(ThreadContext.java:383)
2> 	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.wrapRunnable(EsThreadPoolExecutor.java:161)
2> 	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:82)
2> 	at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:244)
2> 	at org.elasticsearch.transport.InboundHandler.access$000(InboundHandler.java:43)
2> 	at org.elasticsearch.transport.InboundHandler$1.onFailure(InboundHandler.java:219)
2> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:758)
2> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39)
2> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2> 	at java.lang.Thread.run(Thread.java:748)

and many occurrences of the following:

...
1> [2019-07-16T07:15:58,829][WARN ][o.e.d.HandshakingTransportAddressConnector] [mock_ node] handshake failed for [connectToRemoteMasterNode[[::1]:9302]]
--
1> java.lang.IllegalStateException: handshake failed, mismatched cluster name [Cluster [single-node-cluster-TEST_WORKER_VM=[38]-CLUSTER_SEED=[-8120033806674142137]-HASH=[12800ACF49D]]] - {[::1]:9302}{otZFAwAAQACZ0eRAAAAAAA}{0:0:0:0:0:0:0:1}{[::1]:9302}
1> 	at org.elasticsearch.transport.TransportService$5.onResponse(TransportService.java:461) [main/:?]
1> 	at org.elasticsearch.transport.TransportService$5.onResponse(TransportService.java:456) [main/:?]
1> 	at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:54) [main/:?]
1> 	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1146) [main/:?]
1> 	at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:224) [main/:?]
1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) [main/:?]
1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
1> [2019-07-16T07:15:58,830][WARN ][o.e.d.HandshakingTransportAddressConnector] [mock_ node] handshake failed for [connectToRemoteMasterNode[127.0.0.1:9302]]
1> java.lang.IllegalStateException: handshake failed, mismatched cluster name [Cluster [single-node-cluster-TEST_WORKER_VM=[38]-CLUSTER_SEED=[-8120033806674142137]-HASH=[12800ACF49D]]] - {127.0.0.1:9302}{otZFAwAAQACZ0eRAAAAAAA}{127.0.0.1}{127.0.0.1:9302}
1> 	at org.elasticsearch.transport.TransportService$5.onResponse(TransportService.java:461) [main/:?]
1> 	at org.elasticsearch.transport.TransportService$5.onResponse(TransportService.java:456) [main/:?]
1> 	at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:54) [main/:?]
1> 	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1146) [main/:?]
1> 	at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:224) [main/:?]
1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) [main/:?]
1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
...

Not muting, 1st occurrence ever.

Full logs can be found here

@matriv matriv added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI labels Jul 16, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@original-brownbear original-brownbear self-assigned this Jul 17, 2019
@original-brownbear
Copy link
Member

Looks like this is probably also tackled in #43249 -> unassigning myself

@original-brownbear original-brownbear removed their assignment Jul 23, 2019
@alpar-t alpar-t added :Core/Infra/Core Core issues without another label and removed :Delivery/Build Build or test infrastructure labels Sep 5, 2019
@jaymode jaymode self-assigned this Nov 15, 2019
jaymode added a commit that referenced this issue Nov 22, 2019
The NodeTests class contains tests that check behavior when shutting
down a node. This involves starting a node, performing some operation,
stopping the node, and then awaiting the close of the node. Part of
closing a node is the termination of the node's ThreadPool. ThreadPool
termination semantics can be deceiving. The ThreadPool#terminate method
takes a timeout value and the first oddity is that the terminate method
can take two times the timeout value before returning. Internally this
method acts on the ExecutorService instances that are held by the
ThreadPool. First, an orderly shutdown is attempted and pending tasks
are allowed to execute while waiting for the timeout value. If any of
the ExecutorService instances have not terminated, a call is made to
attempt to stop all active tasks (usually using interrupts) and then
waits for up to the timeout value a second time for the termination of
the ExecutorService instances. This means that if use a large value
when waiting for a node to close, we may not attempt to interrupt any
threads that are in a blocking call before the test times out.

In order to avoid causing these tests to time out, this change reduces
the timeout passed to Node#awaitClose to 10 seconds from 1 day. This
will allow blocked threads to be interrupted before the test suite
fails due to the timeout.

Closes #44256
Closes #42350
Closes #44435
jaymode added a commit that referenced this issue Nov 22, 2019
The NodeTests class contains tests that check behavior when shutting
down a node. This involves starting a node, performing some operation,
stopping the node, and then awaiting the close of the node. Part of
closing a node is the termination of the node's ThreadPool. ThreadPool
termination semantics can be deceiving. The ThreadPool#terminate method
takes a timeout value and the first oddity is that the terminate method
can take two times the timeout value before returning. Internally this
method acts on the ExecutorService instances that are held by the
ThreadPool. First, an orderly shutdown is attempted and pending tasks
are allowed to execute while waiting for the timeout value. If any of
the ExecutorService instances have not terminated, a call is made to
attempt to stop all active tasks (usually using interrupts) and then
waits for up to the timeout value a second time for the termination of
the ExecutorService instances. This means that if use a large value
when waiting for a node to close, we may not attempt to interrupt any
threads that are in a blocking call before the test times out.

In order to avoid causing these tests to time out, this change reduces
the timeout passed to Node#awaitClose to 10 seconds from 1 day. This
will allow blocked threads to be interrupted before the test suite
fails due to the timeout.

Closes #44256
Closes #42350
Closes #44435
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants