Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a better way to declare nodes/clusters/cluster formation during the build #30904

Closed
1 of 7 tasks
alpar-t opened this issue May 28, 2018 · 6 comments
Closed
1 of 7 tasks
Assignees
Labels
:Delivery/Build Build or test infrastructure >enhancement Team:Delivery Meta label for Delivery team

Comments

@alpar-t
Copy link
Contributor

alpar-t commented May 28, 2018

Todo:

  • implement Version in java so we can use it in cluster-formation
  • rename to testClusters and TestClustersPlugin ditching ClusterFormation
  • proof of concept plugin to check the integration points with Gradle and write integration test
  • implement support for setting up a single node cluster and actually starting and using it
  • restrict the type of tasks that can use the plugin by default ( ony configure task extensions on specific tasks )
  • start using the new cluster-formation for rest integration tests ( modules, plugins )
  • start using the new cluster-formation for rest integration tests on x-pack

DSL Glimpse

plugins {
    id 'elasticsearch.clusterformation'
}

testClusters {
    myTestCluster {
        distribution = 'ZIP'
        version = '6.3.0'
    }
}

task user1 {
    useCluster testClusters.myTestCluster
    doLast {
        println "Cluster running @ ${elasticsearchNodes.myTestCluster.httpSocketURI}"
    }
}

task user2 {
    useCluster testClusters.myTestCluster
    doLast {
        println "Cluster running @ ${elasticsearchNodes.myTestCluster.httpSocketURI}"
    }
}

Produces this output:

> Task :syncClusterFormationArtifacts UP-TO-DATE

> Task :user1
Starting `myTestCluster`
Cluster running @ [::1]:37347
Not stopping `myTestCluster`, since node still has 1 claim(s)

> Task :user2
Cluster running @ [::1]:37347
Stopping `myTestCluster`, number of claims is 0

BUILD SUCCESSFUL in 10s
3 actionable tasks: 2 executed, 1 up-to-date

Initial Description

The current cluster formation has the following limitations:

  • no straight forward way to create additional clusters, define relationships between them
  • does not currently work with --parallel, and as such has support for no parallelism ( note that test.jvm doesn't help here, these tests always run in sequence)
  • complex tests like rolling upgrade are not readable at all as they make use of relations between Gradle tasks that are really hard to follow.

The main reason --parallel does not work is that Gradle's finalizedBy does not offer any guarantees about when the task will be run. We sue this for stopping clusters, but when running with parallel Gradle puts that off so that one can end up running with 40+ es nodes ( 512mb * 40 ~ 20GB ) before running out of memory and build starting to fail because of this. There is no easy fix for this, other than setting up a bunch of mustRunAfter rules fro the different tasks. Some test run across clusters, upgrade and restart nodes, etc we can't make any assumptions about when the stop tasks is safe to run, so we can't really enforce a "stop after test runner for this cluster completed" rule as the test runners of other clusters might still need this cluster.

Even after doing some hacks to bring down the nodes sooner and not run out of memory, --parallel uncovered some missing ordering relations between tasks that were causing failures.

From some limited testing, I estimate build time could be reduced by at least 30% by being able to run integ tests in parallel (based on running :qa:check on my 6 physical core CPU with 32GB ram).

From what I can see, this is the only thing preventing us from simply running builds with clean check --parallel without having to pick and choose what works in parallel and what doesn't.

I think we should create a cluster formation DSL that does not rely on Gradle tasks to perform it's operations. We would still use gradle to fetch and set up distributions, but everything else would be externalized. The DSL would provide configuration for the cluster and expose methods to alter it's state (start/stop the cluster or individual nodes, change configuration etc ).
There would be methods for high level operations like starting and stopping the cluster, and running tests as well as lower level operations that can manipulate at the node level.

No operation would be carried out by default, a task would have to be set up that calls these operation from the task action (or as doLast). We can provide a task as well with the option to control if it's created to cover the common setup of setting up cluster, running tests and terminating.
Of course we would need to have a way to run tests outside of Gradle, but since we don't use it's infrastructure to do it anyway, it shouldn't be that hard.
The custom DSL can make use of Gradles NamedDomainObjectCollection so plugins can change defaults for different sections of the builds when a new cluster is defined.

Related: #30874, #30903

@alpar-t alpar-t added >enhancement :Delivery/Build Build or test infrastructure labels May 28, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@alpar-t
Copy link
Contributor Author

alpar-t commented Jun 28, 2018

Another possibly interesting use-case of managing the clusters in this way is that we might fingerprint the config (distro, plugins, config etc ) and instead of spinning one up for each test, clean up and re-use running instances.
@rjernst @nik9000 do you see that as a possibility ? Will the overlap be significant ? How concerned are we about cross talk between test ? We could have some similar logging to tests per JVM, forcing the same grouping of tests per cluster for reproducibility will require more thinking, but i's not impossible to implement. I'm also not sure what the potential saving is, i.e. how much time is spent to start up and wait for clusters and how many times we do so in a build.

@rjernst
Copy link
Member

rjernst commented Jun 28, 2018

I don't think we have the exact same config often enough, nor do I think the potential savings would be worth the headache of non-reproducibility (ie if the cluster is altered in some way by a different test runner that has a side affect on later tests run).

@davidkyle
Copy link
Member

It is desirable to have the ability to configure individual nodes in a cluster with node specific settings. For example, given a 3 node cluster the ability to enable machine learning (node.ml: true) on a subset of those nodes would cover test cases that are not possible or easily achievable at the moment due to the fact that all nodes on the cluster have the same configuration. Ideally the config could be randomised (1, 2 or 3 of the nodes are ML nodes, the master node is not an ML node, etc) and the specific configuration would be reproducible from the test seed. This would cover a particular class of bugs we have seen in ML and make intermittent test failures more reproducible.

I would imagine this ability is also useful for testing ingest nodes, dedicated masters and data only nodes. @atorok can you please consider this request when designing the DSL

@alpar-t
Copy link
Contributor Author

alpar-t commented Jul 17, 2018

@davidkyle I am considering this. I'm focusing on single node cluster first, but I see multi node as a composite, so the DSL could be applied both to cluster level for common config, and node level for customization in a similar manner. I think that will address the needs you describe. Randomization support in the build is a different topic orthogonal to cluster formation. I can imagine other uses for it, and I think it would be useful to add support for it at some point.

@alpar-t
Copy link
Contributor Author

alpar-t commented Aug 7, 2019

We now have the elasticsearch.testclusters plugin to achieve this.
There's no DSL to expose per node configuration yet, but nodes can be individually configured in the internals it's just a matter of exposing that.

@alpar-t alpar-t closed this as completed Aug 7, 2019
@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure >enhancement Team:Delivery Meta label for Delivery team
Projects
None yet
Development

No branches or pull requests

5 participants