Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[System test runner] Add more service deployers #89

Closed
ycombinator opened this issue Sep 9, 2020 · 22 comments
Closed

[System test runner] Add more service deployers #89

ycombinator opened this issue Sep 9, 2020 · 22 comments
Assignees

Comments

@ycombinator
Copy link
Contributor

Follow up to #64.

Currently the system test runner only supports the Docker Compose service deployer. That is, it can only test packages whose services can be spun up using Docker Compose. We should add more service deployers to enable system testing of packages such as system (probably a no-op or minimal service deployer), aws (probably some way to pass connection parameters and credentials via environment variables and/or something that understands terraform files), kubernetes.

@mtojek
Copy link
Contributor

mtojek commented Nov 19, 2020

For reference:

Testing on Kubernetes written by @ChrsMark : https://github.com/elastic/integrations/blob/master/testing/environments/kubernetes/README.md
(probably outdated now as there were many changes introduced to Fleet)

@mtojek
Copy link
Contributor

mtojek commented Jan 5, 2021

We need to cover following providers:

  • AWS
  • Azure
  • CloudFoundry
  • GCP
  • Kubernetes on GCP
  • OpenShift on GCP

(correct me if I missed any of them)

Notes:

  • I decided to split Kubernetes and OpenShift as I'm not sure if their configuration paths can be unified.
  • (priority) let's start with providers we should already support - AWS, Azure
  • I decided to run Kubernetes on GCP not to overload local CPU/mem resources, but we may need to figure out the Docker image distribution model (if we want to use and deploy custom images).

Other use cases:

  • As a developer I'd like to boot up and access a long-running Kubernetes stack, so I can interactively collect metrics and design Kibana dashboards

Technical observations:

Questions:

  1. Should the provider's stack be spawned similarly to the service's Docker-based stack, for the time of tests execution, or in a long-running manner like the Elastic stack?
  2. Should we provide an option of acquiring authorization data for the internal team?

@mtojek
Copy link
Contributor

mtojek commented Jan 5, 2021

@kaiyan-sheng @narph @ChrsMark

Would you mind describing here use cases for AWS, Azure and Kubernetes? I'm looking forward to seeing how these cloud/infra providers can be used for testing integrations.

@ChrsMark
Copy link
Member

ChrsMark commented Jan 7, 2021

Thanks for the ping @mtojek, I will try to provide a scenario, with inline comments/thoughts that would cover our k8s needs.

Vanilla Kubernetes

  1. Run elastic-package k8s up to bring up a k8s cluster, I don't think we should care where it is, on GKE or locally on minikube or kind. Maybe it will be better to have it running on GKE for now to avoid an extra step of minikube/kind installation (?). In this step all the required prerequisites should happen, like installing kube_state_metrics from which state_* metricsets will collect metrics from.
  2. Run elastic-package test k8s (the syntax is abstract here for the sake of the example) so as to deploy agent on the running k8s cluster and enrol it to the Elastic Stack. Elastic stack should be running maybe on the same k8s cluster so as to have easier networking configuration to my mind (similar to the approach mentioned at testing on k8s). After the test is completed the cluster is still up and the agents are still shipping metrics. To clean this up we need to run the next command to bring the whole cluster down.
  3. Run elastic-package k8s down which will destroy the cluster including the Elastic stack and Agents.

Note: I think this scenario will can be expanded to test other packages like istio and ingress-controller by adding them as extra flags in step 1.

OCP

  1. This scenario step will be the same as for vanilla k8s, the only difference will be the installation step, where we need an Openshift installation. Here if we want a from scratch installation we need to run the GCP installer which takes ~40 mins to bring the cluster up. Not sure if this can be part of a CI job. Maybe can be a nightly job. Related to Add Openshift scenario for CI beats#17962. Ping me directly for more info ;).
  2. Same as vanilla k8s but we will need slightly different manifest most probably cause of OCP restrictions.
  3. Same as vanilla k8s, but use the GCP installer script to bring the cluster down.

Note 1: This is only for testing k8s module, but it should be quite similar for testing Autodiscover.
Note 2: The Running Agent on k8s thing is not yet completely decided. Progress/discussion happen around this at k8s-agent WP, cc: @blakerouse

@kaiyan-sheng
Copy link
Contributor

For AWS testing, we can use a terraform script(or anything similar) per dataset/package to create AWS services for testing and cleanup after testing. I think we have an AWS account for testing in Beats jenkins (@jsoriano knows more about this) and we can leverage it here.

For metrics: an example can be we can run elastic-package test ec2-metrics locally to apply the terraform script to create an EC2 instance in AWS, wait for a while till EC2 metrics are sent into CloudWatch, check events collected from ec2-metrics package and delete the EC2 instance at the end.

For logs: We have sample files to test the pipelines already but it would be good to have terraform to setup S3-SQS to test the inputs.

There are two use cases here: one is to run this in CI and the other one is for package developers to test locally. Because creating services can be cost-inefficient, we should consider how frequently should we run elastic-package test ec2-metrics in CI?

@mtojek
Copy link
Contributor

mtojek commented Jan 11, 2021

There are two use cases here: one is to run this in CI and the other one is for package developers to test locally. Because creating services can be cost-inefficient, we should consider how frequently should we run elastic-package test ec2-metrics in CI?

With this PR elastic/integrations#474 tests will be executed only if the relevant packages are changed (in this case AWS integration) or this is the master branch.

Regarding elastic-package test k8s and elastic-package test ec2-metrics I think we need to come up with an open, flexible API, so that we don't have to modify CLI every time we introduce new platform, but this is something we'll research :) I admit I haven't looked at the k8s as a separate stack, rather as a service under test that is alive for the duration of a test. Keeping it as a separate stack (like Elastic stack) might actually simplify things.

@narph
Copy link

narph commented Jan 11, 2021

for Azure we can look at something similar as the use case above. I previously worked on a POC using Pulumi which will authenticate the user, create a storage account , fetch metrics, validate on them and then remove the entire deployment.
I hope it is of interest here elastic/beats#21850.
Maybe something like elastic-package test azure storage could replace the entire process.
For azure logs, more steps are required, for example after creating the event hub we will have to populate it with some valid/invalid messages.
Not sure in how much detail we should go in this issue.

@mtojek
Copy link
Contributor

mtojek commented Jan 14, 2021

I'm going with this issue.

@mtojek
Copy link
Contributor

mtojek commented Jan 14, 2021

Thank you for all feedback, folks! We had a sync-up with @ycombinator to discuss possible options.
Technically - we'll try to implement a generic Terraform based test runner. We wouldn't like to include AWS/Azure/K8s references in the CLI - let's try to make it as generic as possible. The approach will be truly declarative, which is in-line with
the original principle (no programming language is required).

Here is a list of action items to help us solve this issue.

Dev changes in package-spec:

@ycombinator, I still have doubts which path should we follow. If you have any preferences or see benefits of any of them, please feel free to share.

Changes in elastic-package:

Changes in integrations:

@ChrsMark
Copy link
Member

Thanks for the heads-up @mtojek! Feel free to reach out to me if you guys have any questions about the k8s specifics since it can be tricky with different components we collect from unlike other clouds where we define a single exposed endpoint.

@kaiyan-sheng
Copy link
Contributor

With this PR elastic/integrations#474 tests will be executed only if the relevant packages are changed (in this case AWS integration) or this is the master branch.

Great, thank you!

@ycombinator
Copy link
Contributor Author

Thanks for the write up and breakdown of tasks, @mtojek. Very helpful!

Dev changes in package-spec:

  • Allow for data-stream level _dev/deploy definitions
    or
  • Mount extra files for data-stream in runtime (it may prevent from building the image multiple times)

@ycombinator, I still have doubts which path should we follow. If you have any preferences or see benefits of any of them, please feel free to share.

I recall discussing the first option (Allow for data-stream level _dev/deploy definitions) in our meeting today but not the second one (Mount extra files for data-stream in runtime (it may prevent from building the image multiple times)). Would you mind explaining some details about the second option? Thanks.

@mtojek
Copy link
Contributor

mtojek commented Jan 14, 2021

(I came up to this point based on observing the Zeek integration)

I can elaborate on this. Imagine we have an integration XYZ with data streams A, B, C, ... Z. Every data stream is basically the same Docker image with terraform executor and own set of static tf templates. The improvement is to use a single Docker image and simply mount (switch) templates for the data stream test scenario. This way it will be faster than building new Docker image for a data stream.

@ycombinator
Copy link
Contributor Author

ycombinator commented Jan 14, 2021

I always assumed (but probably didn't make it explicit, sorry!) that there would be one shared/common TF executor Docker image that is used by the TF service deployer. The definition and maintenance of this image is the responsibility of elastic-package developers, as opposed to that of package developers.

The part that varies is the TF templates, whether those come from the package-level ({package}/_dev/deploy/tf/...) or the data stream-level ({package}/data_stream/{data stream}/_dev/deploy/tf/...). The definition and maintenance of this image is the responsibility of package developers.

So I think we're on the same page?

@mtojek
Copy link
Contributor

mtojek commented Jan 14, 2021

The part that varies is the TF templates, whether those come from the package-level ({package}/_dev/deploy/tf/...) or the data stream-level ({package}/data_stream/{data stream}/_dev/deploy/tf/...). The definition and maintenance of this image is the responsibility of package developers.

I agree with the rest of your comment. Regarding the quoted paragraph - what is the best of processing these TF templates (belonging to particular data-streams)? Load them in the runtime? Include them in the build time (one image build per data stream)?

(I think we're on the same page, just confirming the implementation details :)

@ycombinator
Copy link
Contributor Author

ycombinator commented Jan 14, 2021

Load them in the runtime? Include them in the build time (one image build per data stream)?

There is also a third option: include all of them at image build time (so you are not building one image per data stream) and then select the right data stream's templates at runtime.

At any rate, I don't know if there's an obvious answer to this one. I would suggest trying one of the options, probably the one you think is simplest to implement, see how well it performs and then iterate from there as necessary.

@jsoriano
Copy link
Member

jsoriano commented Jan 15, 2021

+1 to implement this as a generic declarative Terraform-based runner 👍

Some comments in case they are helpful:

  • In Run kubernetes integration tests inside of a pod and use kind to setup a kubernetes cluster beats#17656 Blake extended mage goIntegTest for Metricbeat to be able to run tests in Kubernetes (with kind) apart of the usual docker compose. There it was also done in a generic way, one provider or the other were used depending on the available files. A similar approach could be followed here to continue supporting docker-compose, or if we want to support other providers in the future.
  • A use case I think that can be powerful for some complex scenarios is the use of Kubernetes operators to provide testing scenarios. Operators make complex deployments easier. I have used https://strimzi.io/ in the past to deploy Kafka clusters with different authentication methods (something quite painful and error-prone to do from scratch), and https://kubecf.io/ to deploy Cloud Foundry clusters (best method I have found so far to reliably deploy CF in an automated way).
    The usual approach for them is to create a namespace, install them with helm there, and use them with custom resources. In principle all of this would be supported by Terraform (namespace, helm, and custom resources), but I would like to see it validated when this is implemented 🙂
  • Common code can be tricky with Terraform. This is more an implementation detail, but something to take into account because this is going to give problems at some point. You will find cases where sharing some Terraform code between scenarios will be useful (or even needed). For example when defining the providers, depending if the tests are run locally or in CI, the authentication methods can be different (e.g. to use a private docker registry in CI), and can be shared between packages (e.g. multiple packages using aws, or kubernetes providers), maybe a way to solve this is to don't include the providers definition in scenarios and generate them when running the tests depending on some config. Another example is shared code itself, it may be tempting to rely on some resource to be always present in some cloud provider, but this can be unreliable and can difficult supporting packages living in their own repositories. An example for this is the usual network configurations that are needed when deploying instances in a cloud provider, maybe elastic-package should always provide some base resources when some specific providers are used, so scenarios can be simpler. Same thing with kubernetes, an scenario could define some kubernetes resources, but elastic-package would provide the cluster and the credentials.

@mtojek
Copy link
Contributor

mtojek commented Jan 15, 2021

Thank you for sharing your mind, lot's of tricky ideas ;) I like the idea of kops.

In elastic/beats#17656 Blake extended mage goIntegTest for Metricbeat to be able to run tests in Kubernetes (with kind) apart of the usual docker compose. There it was also done in a generic way, one provider or the other were used depending on the available files. A similar approach could be followed here to continue supporting docker-compose, or if we want to support other providers in the future.

Honestly I think we're not there yet. First, the Elastic-Agent needs to support autodiscovery and Kubernetes runtime. Then we can think about potential integrations.
Keep in mind that we'd like to examine integrations not the entire the end-to-end flow. I would leave the verification of the Elastic-Agent functionality in different runtime to the Agent or e2e-tests.

@ChrsMark
Copy link
Member

@mtojek @ycombinator fyi for k8s package testing I'm using some mock APIs so as to proceed until we reach to a more permanent solution. You can find more at elastic/integrations#569.

While working with these mocks I realise more the need for running against an actual k8s cluster and more specifically having Agent deployed on the cluster natively. Without this, many things like k8s tokens crts etc we need will not be valid.

@ycombinator
Copy link
Contributor Author

While working with these mocks I realise more the need for running against an actual k8s cluster and more specifically having Agent deployed on the cluster natively. Without this, many things like k8s tokens crts etc we need will not be valid.

This is super valuable information. @mtojek and I have informally discussed the idea that for some service deployers it might make sense to deploy the agent "along side" the service — your findings seem to be along these lines so this is very valuable feedback. Thank you!

@mtojek
Copy link
Contributor

mtojek commented Feb 3, 2021

@kaiyan-sheng AWS integration can be tested now using the Terraform executor (sample here: https://github.com/elastic/integrations/tree/master/packages/aws/data_stream/ec2_metrics).

@narph this feature is written in a generic way. If you pass secrets for Azure and write some TF code, it's expected to work.

EDIT:

we just need to enable secrets on the Jenkins side, but shouldn't be a big issue (unless we don't have them generated at all).

@mtojek
Copy link
Contributor

mtojek commented Feb 18, 2021

Let me summarize it -

We've delivered (and applied in Integrations):

  • generic Terraform service deployer, that currently supports AWS and possibly other providers like Azure, GCP, etc., using environment variables to pass credentials
  • Kubernetes service deployer which uses kind and potentially additional resources (e.g. custom application deployment).

@mtojek mtojek closed this as completed Feb 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants