Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add agent for testing pod networking #1448

Merged
merged 2 commits into from
May 3, 2021
Merged

Conversation

abhipth
Copy link
Contributor

@abhipth abhipth commented Apr 29, 2021

What type of PR is this?
Add the test agent. The test agent docker image provides testing utilities that will be called from the automation test suite to do test verification.

For instance, in order to test the networking setup of all pods on the host we can use the agent which utilizes the netlink package to test the networking setup for the host.

Server Deployment and Pods

apiVersion: apps/v1
kind: Deployment
metadata:
  name: traffic-server
  labels:
    app: traffic-server
spec:
  replicas: 5
  selector:
    matchLabels:
      app: traffic-server
  template:
    metadata:
      labels:
        app: traffic-server
    spec:
      containers:
      - name: server
        image: READACTED.dkr.ecr.us-west-2.amazonaws.com/amazon/amazon-k8s-cni/test/agent:v1.6.4-rc1-134-gfd1a17cf
        command: ['./traffic-server']
default       traffic-server-5c946cc7d-6w4jb   1/1     Running     0          16m     10.2.3.166     ip-10-2-126-13.us-west-2.compute.internal    <none>           <none>
default       traffic-server-5c946cc7d-fgdck   1/1     Running     0          16m     10.0.243.145   ip-10-0-64-125.us-west-2.compute.internal    <none>           <none>
default       traffic-server-5c946cc7d-kvvpm   1/1     Running     0          16m     10.1.184.128   ip-10-1-216-124.us-west-2.compute.internal   <none>           <none>
default       traffic-server-5c946cc7d-p4bd7   1/1     Running     0          16m     10.2.199.56    ip-10-2-126-13.us-west-2.compute.internal    <none>           <none>
default       traffic-server-5c946cc7d-rr6px   1/1     Running     0          16m     10.1.220.135   ip-10-1-216-124.us-west-2.compute.internal   <none>           <none>

Client Job and pods

apiVersion: batch/v1
kind: Job
metadata:
  name: traffic-client
spec:
  parallelism: 4
  template:
    spec:
      containers:
      - name: traffic-client
        image: REDACTED.dkr.ecr.us-west-2.amazonaws.com/amazon/amazon-k8s-cni/test/agent:v1.6.4-rc1-134-gfd1a17cf
        command: ['./traffic-client']
        args: ['-server-list-csv=10.2.3.166,10.0.243.145,10.1.184.128,10.2.199.56,10.1.220.135, 10.0.64.125', '-metric-aggregator-addr=http://10.0.154.81:8080/submit/metric/connectivity']
      restartPolicy: Never
  backoffLimit: 4
default       traffic-client-5rbn8             0/1     Completed   0          53s     10.2.65.187    ip-10-2-126-13.us-west-2.compute.internal    <none>           <none>
default       traffic-client-fktg9             0/1     Completed   0          53s     10.2.24.106    ip-10-2-126-13.us-west-2.compute.internal    <none>           <none>
default       traffic-client-nsb9d             0/1     Completed   0          53s     10.0.161.69    ip-10-0-64-125.us-west-2.compute.internal    <none>           <none>
default       traffic-client-wrb8f             0/1     Completed   0          53s     10.1.181.173   ip-10-1-216-124.us-west-2.compute.internal   <none>           <none>

Metric Server deployment and pod

apiVersion: v1
kind: Pod
metadata:
  name: metric-server
  labels:
    app: metric-server
spec:
  containers:
  - image: REDACTED.dkr.ecr.us-west-2.amazonaws.com/amazon/amazon-k8s-cni/test/agent:v1.6.4-rc1-134-gfd1a17cf
    command: [./metric-server]
    imagePullPolicy: IfNotPresent
    name: metric-server
  restartPolicy: Always

default       metric-server                    1/1     Running     0          2m20s   10.0.154.81    ip-10-0-64-125.us-west-2.compute.internal    <none>           <none>

Output of metric aggregator!
Here on 10.0.64.125 server is not running so we see failure across all the client pods for the same IP.

[
  {
    "SuccessCount": 5,
    "FailureCount": 1,
    "SourcePod": "traffic-client-nsb9d",
    "Failures": [
      {
        "DestinationIP": " 10.0.64.125:2273",
        "FailureReason": "failed to connect to server dial tcp: lookup  10.0.64.125: no such host"
      }
    ]
  },
  {
    "SuccessCount": 5,
    "FailureCount": 1,
    "SourcePod": "traffic-client-wrb8f",
    "Failures": [
      {
        "DestinationIP": " 10.0.64.125:2273",
        "FailureReason": "failed to connect to server dial tcp: lookup  10.0.64.125: no such host"
      }
    ]
  },
  {
    "SuccessCount": 5,
    "FailureCount": 1,
    "SourcePod": "traffic-client-5rbn8",
    "Failures": [
      {
        "DestinationIP": " 10.0.64.125:2273",
        "FailureReason": "failed to connect to server dial tcp: lookup  10.0.64.125: no such host"
      }
    ]
  },
  {
    "SuccessCount": 5,
    "FailureCount": 1,
    "SourcePod": "traffic-client-fktg9",
    "Failures": [
      {
        "DestinationIP": " 10.0.64.125:2273",
        "FailureReason": "failed to connect to server dial tcp: lookup  10.0.64.125: no such host"
      }
    ]
  }
]

The README.md has the complete description for the purpose for the change.

Which issue does this PR fix:
Automation Test

What does this PR do / Why do we need it:
Automation Test

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
NA

Testing done on this change:
Yes, tested locally.

Automation added to e2e:
Yes

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No

Does this change require updates to the CNI daemonset config files to work?:
No

Does this PR introduce any user-facing change?:
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@@ -0,0 +1,71 @@
###Test Agent

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to see this approach for testing 👍

secondaryRouteTableIndex := make(map[int]bool)

// For each Pod validate the Pod networking
for _, pod := range podNetworkingValidationInput.PodList {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing the list of pods from the caller, consider running this as a daemon on every node and let this pull what pods are running on the node this daemon is on, run all validation checks for each pod and report the results. You can scale this framework to large number of nodes/pods this way

log.Printf("validated route table for secondary ENI %d has right routes", index)
}

return validationErrors

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have a struct where you can define pod and failure type(s) - will make debugging easier


var connectivityMetric []input.TestStatus

// metric server stores metrics from test client and returns the aggregated metrics to the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's be nice to expose different types of metrics/failures/successes add labels on each metric and additionally expose them in prometheus format. For log running validation and some other use-cases, it will be come in handy

###Test Agent
The test agent contains multiple binaries that are used by Ginkgo Automation tests.

###List of Go Binaries in the Agent

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

###List of Go Binaries in the Agent > ### List of Go Binaries in the Agent so it renders correctly. Same for all the other places

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants