Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] BulkIntegrationIT.testBulkWithGlobalDefaults fails #39702

Closed
dimitris-athanasiou opened this issue Mar 5, 2019 · 3 comments
Closed

[CI] BulkIntegrationIT.testBulkWithGlobalDefaults fails #39702

dimitris-athanasiou opened this issue Mar 5, 2019 · 3 comments
Assignees
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >test-failure Triaged test failures from CI

Comments

@dimitris-athanasiou
Copy link
Contributor

Build failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=opensuse/281/console

Reproduce with:

./gradlew :server:integTest -Dtests.seed=60039132A5F89649 -Dtests.class=org.elasticsearch.action.bulk.BulkIntegrationIT -Dtests.method="testBulkWithGlobalDefaults" -Dtests.security.manager=true -Dtests.locale=en-US -Dtests.timezone=MST -Dcompiler.java=11 -Druntime.java=8

Failure:

ERROR   67.5s J3 | BulkIntegrationIT.testBulkWithGlobalDefaults <<< FAILURES!
17:04:47    > Throwable #1: NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{wNWptWZ-Q_Cj-KmTny_mbA}{127.0.0.1}{127.0.0.1:41774}]]
17:04:47    > 	at __randomizedtesting.SeedInfo.seed([60039132A5F89649:6AED42E242C62180]:0)
17:04:47    > 	at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:352)
17:04:47   2> NOTE: leaving temporary files on disk at: /var/lib/jenkins/workspace/elastic+elasticsearch+master+multijob-unix-compatibility/os/opensuse/server/build/testrun/integTest/J3/temp/org.elasticsearch.action.bulk.BulkIntegrationIT_60039132A5F89649-001
17:04:47   2> NOTE: test params are: codec=Asserting(Lucene80): {_routing=PostingsFormat(name=Asserting), field1.keyword=PostingsFormat(name=Asserting), processed=PostingsFormat(name=Asserting), field1=PostingsFormat(name=Asserting), _field_names=PostingsFormat(name=Asserting), foo=PostingsFormat(name=Asserting), _id=PostingsFormat(name=Asserting), foo.keyword=PostingsFormat(name=Asserting)}, docValues:{__soft_deletes=DocValuesFormat(name=Lucene80), field1.keyword=DocValuesFormat(name=Lucene80), _seq_no=DocValuesFormat(name=Asserting), processed=DocValuesFormat(name=Lucene80), _tombstone=DocValuesFormat(name=Lucene80), _primary_term=DocValuesFormat(name=Asserting), _version=DocValuesFormat(name=Lucene80), foo.keyword=DocValuesFormat(name=Asserting)}, maxPointsInLeafNode=815, maxMBSortInHeap=7.958529825930695, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@401b021c), locale=en-US, timezone=MST
17:04:47   2> NOTE: Linux 4.4.104-39-default amd64/Oracle Corporation 1.8.0_202 (64-bit)/cpus=16,threads=1,free=374384416,total=516947968
17:04:47   2> NOTE: All tests run in this JVM: [IndexPrimaryRelocationIT, MissingValueIT, ExtendedStatsIT, ExplainActionIT, AllocationIdIT, BucketScriptIT, IndicesOptionsIntegrationIT, HotThreadsIT, IndexingMasterFailoverIT, SettingsFilteringIT, DedicatedMasterGetFieldMappingIT, ExternalValuesMapperIntegrationIT, FilterIT, ScriptedMetricIT, BulkIntegrationIT]
17:04:47    > 	at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:248)
17:04:47    > 	at org.elasticsearch.client.transport.TransportProxyClient.execute(TransportProxyClient.java:57)
17:04:47    > 	at org.elasticsearch.client.transport.TransportClient.doExecute(TransportClient.java:386)
17:04:47    > 	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:394)
17:04:47    > 	at org.elasticsearch.client.FilterClient.doExecute(FilterClient.java:65)
17:04:47    > 	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:394)
17:04:47    > 	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:383)
17:04:47    > 	at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.execute(AbstractClient.java:677)
17:04:47    > 	at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:45)
17:04:47    > 	at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:553)
17:04:47    > 	at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2165)
17:04:47    > 	at java.lang.Thread.run(Thread.java:748)
`

I'll leave this unmuted as Armin is actively looking into this.
@dimitris-athanasiou dimitris-athanasiou added >test-failure Triaged test failures from CI :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Mar 5, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@original-brownbear
Copy link
Member

This looks like another problem of the category deadlock in IO call-back (#39168), I'm seeing the typical blocks in the logs:

  1> [2019-03-05T08:03:39,622][INFO ][o.e.c.r.a.AllocationService] [node_sm2] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[test][0]] ...]).
  1> [2019-03-05T08:04:41,713][WARN ][o.e.t.TransportService   ] [node_sm1] Received response for a request that has timed out, sent [54515ms] ago, timed out [44395ms] ago, action [internal:coordination/fault_detection/leader_check], node [{node_sm2}{aGpuf0WiQZWVzCoFwCgZyw}{TeLINThfQKaY45XXf3alcw}{127.0.0.1}{127.0.0.1:37125}], id [43]
  1> [2019-03-05T08:04:41,714][WARN ][o.e.t.TransportService   ] [node_sm2] Received response for a request that has timed out, sent [54313ms] ago, timed out [44324ms] ago, action [internal:coordination/fault_detection/follower_check], node [{node_sm1}

looking into.

@original-brownbear
Copy link
Member

I'm reversing my opinion from the above and I'd vote to close this because:

  1. It has only happened once, neither before nor after the reported build.
  2. It is not a bug in the category of dead-locking in a test transport callback since there aren't any here.
  3. I extensively used YourKit to find all the monitor use scenarios on the transport thread in this test and couldn't find a single suspicious one.

=> It isn't inconceivable that the VM simply froze up for 40s (or came close to that). What is especially interesting here is that the time between the two log statements around which the request timed out is 62.1s:

1> [2019-03-05T08:03:39,622][INFO ][o.e.c.r.a.AllocationService] [node_sm2] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[test][0]] ...]).
  1> [2019-03-05T08:04:41,713][WARN ][o.e.t.TransportService   ] [node_sm1] Received response for a request that has timed out, sent [54515ms] ago, timed out [44395ms] ago, action [internal:coordination/fault_detection/leader_check], node [{node_sm2}{aGpuf0WiQZWVzCoFwCgZyw}{TeLINThfQKaY45XXf3alcw}{127.0.0.1}{127.0.0.1:37125}], id [43]

which is somewhat unlikely to be the result of a locked transport thread (at least I don't see which of the involved request here would give us a ~60s timeout).

=> close here and reopen if it pops up again imo.

@ywelsch ywelsch closed this as completed Mar 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants