Provide a better way to declare nodes/clusters/cluster formation during the build #30904

alpar-t · 2018-05-28T12:18:41Z

Todo:

implement Version in java so we can use it in cluster-formation
rename to testClusters and TestClustersPlugin ditching ClusterFormation
proof of concept plugin to check the integration points with Gradle and write integration test
implement support for setting up a single node cluster and actually starting and using it
restrict the type of tasks that can use the plugin by default ( ony configure task extensions on specific tasks )
start using the new cluster-formation for rest integration tests ( modules, plugins )
start using the new cluster-formation for rest integration tests on x-pack

DSL Glimpse

plugins {
    id 'elasticsearch.clusterformation'
}

testClusters {
    myTestCluster {
        distribution = 'ZIP'
        version = '6.3.0'
    }
}

task user1 {
    useCluster testClusters.myTestCluster
    doLast {
        println "Cluster running @ ${elasticsearchNodes.myTestCluster.httpSocketURI}"
    }
}

task user2 {
    useCluster testClusters.myTestCluster
    doLast {
        println "Cluster running @ ${elasticsearchNodes.myTestCluster.httpSocketURI}"
    }
}

Produces this output:

> Task :syncClusterFormationArtifacts UP-TO-DATE

> Task :user1
Starting `myTestCluster`
Cluster running @ [::1]:37347
Not stopping `myTestCluster`, since node still has 1 claim(s)

> Task :user2
Cluster running @ [::1]:37347
Stopping `myTestCluster`, number of claims is 0

BUILD SUCCESSFUL in 10s
3 actionable tasks: 2 executed, 1 up-to-date

Initial Description

The current cluster formation has the following limitations:

no straight forward way to create additional clusters, define relationships between them
does not currently work with --parallel, and as such has support for no parallelism ( note that test.jvm doesn't help here, these tests always run in sequence)
complex tests like rolling upgrade are not readable at all as they make use of relations between Gradle tasks that are really hard to follow.

The main reason --parallel does not work is that Gradle's finalizedBy does not offer any guarantees about when the task will be run. We sue this for stopping clusters, but when running with parallel Gradle puts that off so that one can end up running with 40+ es nodes ( 512mb * 40 ~ 20GB ) before running out of memory and build starting to fail because of this. There is no easy fix for this, other than setting up a bunch of mustRunAfter rules fro the different tasks. Some test run across clusters, upgrade and restart nodes, etc we can't make any assumptions about when the stop tasks is safe to run, so we can't really enforce a "stop after test runner for this cluster completed" rule as the test runners of other clusters might still need this cluster.

Even after doing some hacks to bring down the nodes sooner and not run out of memory, --parallel uncovered some missing ordering relations between tasks that were causing failures.

From some limited testing, I estimate build time could be reduced by at least 30% by being able to run integ tests in parallel (based on running :qa:check on my 6 physical core CPU with 32GB ram).

From what I can see, this is the only thing preventing us from simply running builds with clean check --parallel without having to pick and choose what works in parallel and what doesn't.

I think we should create a cluster formation DSL that does not rely on Gradle tasks to perform it's operations. We would still use gradle to fetch and set up distributions, but everything else would be externalized. The DSL would provide configuration for the cluster and expose methods to alter it's state (start/stop the cluster or individual nodes, change configuration etc ).
There would be methods for high level operations like starting and stopping the cluster, and running tests as well as lower level operations that can manipulate at the node level.

No operation would be carried out by default, a task would have to be set up that calls these operation from the task action (or as doLast). We can provide a task as well with the option to control if it's created to cover the common setup of setting up cluster, running tests and terminating.
Of course we would need to have a way to run tests outside of Gradle, but since we don't use it's infrastructure to do it anyway, it shouldn't be that hard.
The custom DSL can make use of Gradles NamedDomainObjectCollection so plugins can change defaults for different sections of the builds when a new cluster is defined.

Related: #30874, #30903

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-05-28T12:18:42Z

Pinging @elastic/es-core-infra

alpar-t · 2018-06-28T13:26:58Z

Another possibly interesting use-case of managing the clusters in this way is that we might fingerprint the config (distro, plugins, config etc ) and instead of spinning one up for each test, clean up and re-use running instances.
@rjernst @nik9000 do you see that as a possibility ? Will the overlap be significant ? How concerned are we about cross talk between test ? We could have some similar logging to tests per JVM, forcing the same grouping of tests per cluster for reproducibility will require more thinking, but i's not impossible to implement. I'm also not sure what the potential saving is, i.e. how much time is spent to start up and wait for clusters and how many times we do so in a build.

rjernst · 2018-06-28T23:45:44Z

I don't think we have the exact same config often enough, nor do I think the potential savings would be worth the headache of non-reproducibility (ie if the cluster is altered in some way by a different test runner that has a side affect on later tests run).

davidkyle · 2018-07-16T13:17:37Z

It is desirable to have the ability to configure individual nodes in a cluster with node specific settings. For example, given a 3 node cluster the ability to enable machine learning (node.ml: true) on a subset of those nodes would cover test cases that are not possible or easily achievable at the moment due to the fact that all nodes on the cluster have the same configuration. Ideally the config could be randomised (1, 2 or 3 of the nodes are ML nodes, the master node is not an ML node, etc) and the specific configuration would be reproducible from the test seed. This would cover a particular class of bugs we have seen in ML and make intermittent test failures more reproducible.

I would imagine this ability is also useful for testing ingest nodes, dedicated masters and data only nodes. @atorok can you please consider this request when designing the DSL

alpar-t · 2018-07-17T05:17:05Z

@davidkyle I am considering this. I'm focusing on single node cluster first, but I see multi node as a composite, so the DSL could be applied both to cluster level for common config, and node level for customization in a similar manner. I think that will address the needs you describe. Randomization support in the build is a different topic orthogonal to cluster formation. I can imagine other uses for it, and I think it would be useful to add support for it at some point.

alpar-t · 2019-08-07T13:52:03Z

We now have the elasticsearch.testclusters plugin to achieve this.
There's no DSL to expose per node configuration yet, but nodes can be individually configured in the internals it's just a matter of exposing that.

alpar-t added >enhancement :Delivery/Build Build or test infrastructure labels May 28, 2018

alpar-t self-assigned this May 28, 2018

alpar-t mentioned this issue May 28, 2018

Add support for switching distribution for all integration tests #30874

Merged

7 tasks

This was referenced Jun 22, 2018

Run tests with Gradle test runner instead of randomizedtesting.junit4-ant #31496

Closed

Cluster formation with ref counting - part1 #31658

Closed

alpar-t mentioned this issue Jul 9, 2018

Assertion failures cause yaml REST tests to timeout instead of displaying error #31834

Closed

alpar-t mentioned this issue Aug 16, 2018

Fix/30904 cluster formation part2 #32877

Merged

alpar-t mentioned this issue Aug 30, 2018

Test clusters ( was clusterformation ) - part 3 #33264

Closed

This was referenced Oct 17, 2018

We need restart tests #34524

Closed

Tests for "flavor switching" #34526

Closed

alpar-t mentioned this issue Oct 25, 2018

REST test against clusters with dedicated master nodes #34563

Open

alpar-t closed this as completed Aug 7, 2019

mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a better way to declare nodes/clusters/cluster formation during the build #30904

Provide a better way to declare nodes/clusters/cluster formation during the build #30904

alpar-t commented May 28, 2018 •

edited

Loading

elasticmachine commented May 28, 2018

alpar-t commented Jun 28, 2018

rjernst commented Jun 28, 2018

davidkyle commented Jul 16, 2018

alpar-t commented Jul 17, 2018

alpar-t commented Aug 7, 2019

Provide a better way to declare nodes/clusters/cluster formation during the build #30904

Provide a better way to declare nodes/clusters/cluster formation during the build #30904

Comments

alpar-t commented May 28, 2018 • edited Loading

Todo:

DSL Glimpse

Initial Description

elasticmachine commented May 28, 2018

alpar-t commented Jun 28, 2018

rjernst commented Jun 28, 2018

davidkyle commented Jul 16, 2018

alpar-t commented Jul 17, 2018

alpar-t commented Aug 7, 2019

alpar-t commented May 28, 2018 •

edited

Loading