add agent for testing pod networking #1448

abhipth · 2021-04-29T17:34:51Z

What type of PR is this?
Add the test agent. The test agent docker image provides testing utilities that will be called from the automation test suite to do test verification.

For instance, in order to test the networking setup of all pods on the host we can use the agent which utilizes the netlink package to test the networking setup for the host.

Server Deployment and Pods

apiVersion: apps/v1
kind: Deployment
metadata:
  name: traffic-server
  labels:
    app: traffic-server
spec:
  replicas: 5
  selector:
    matchLabels:
      app: traffic-server
  template:
    metadata:
      labels:
        app: traffic-server
    spec:
      containers:
      - name: server
        image: READACTED.dkr.ecr.us-west-2.amazonaws.com/amazon/amazon-k8s-cni/test/agent:v1.6.4-rc1-134-gfd1a17cf
        command: ['./traffic-server']

default       traffic-server-5c946cc7d-6w4jb   1/1     Running     0          16m     10.2.3.166     ip-10-2-126-13.us-west-2.compute.internal    <none>           <none>
default       traffic-server-5c946cc7d-fgdck   1/1     Running     0          16m     10.0.243.145   ip-10-0-64-125.us-west-2.compute.internal    <none>           <none>
default       traffic-server-5c946cc7d-kvvpm   1/1     Running     0          16m     10.1.184.128   ip-10-1-216-124.us-west-2.compute.internal   <none>           <none>
default       traffic-server-5c946cc7d-p4bd7   1/1     Running     0          16m     10.2.199.56    ip-10-2-126-13.us-west-2.compute.internal    <none>           <none>
default       traffic-server-5c946cc7d-rr6px   1/1     Running     0          16m     10.1.220.135   ip-10-1-216-124.us-west-2.compute.internal   <none>           <none>

Client Job and pods

apiVersion: batch/v1
kind: Job
metadata:
  name: traffic-client
spec:
  parallelism: 4
  template:
    spec:
      containers:
      - name: traffic-client
        image: REDACTED.dkr.ecr.us-west-2.amazonaws.com/amazon/amazon-k8s-cni/test/agent:v1.6.4-rc1-134-gfd1a17cf
        command: ['./traffic-client']
        args: ['-server-list-csv=10.2.3.166,10.0.243.145,10.1.184.128,10.2.199.56,10.1.220.135, 10.0.64.125', '-metric-aggregator-addr=http://10.0.154.81:8080/submit/metric/connectivity']
      restartPolicy: Never
  backoffLimit: 4

default       traffic-client-5rbn8             0/1     Completed   0          53s     10.2.65.187    ip-10-2-126-13.us-west-2.compute.internal    <none>           <none>
default       traffic-client-fktg9             0/1     Completed   0          53s     10.2.24.106    ip-10-2-126-13.us-west-2.compute.internal    <none>           <none>
default       traffic-client-nsb9d             0/1     Completed   0          53s     10.0.161.69    ip-10-0-64-125.us-west-2.compute.internal    <none>           <none>
default       traffic-client-wrb8f             0/1     Completed   0          53s     10.1.181.173   ip-10-1-216-124.us-west-2.compute.internal   <none>           <none>

Metric Server deployment and pod

apiVersion: v1
kind: Pod
metadata:
  name: metric-server
  labels:
    app: metric-server
spec:
  containers:
  - image: REDACTED.dkr.ecr.us-west-2.amazonaws.com/amazon/amazon-k8s-cni/test/agent:v1.6.4-rc1-134-gfd1a17cf
    command: [./metric-server]
    imagePullPolicy: IfNotPresent
    name: metric-server
  restartPolicy: Always

default       metric-server                    1/1     Running     0          2m20s   10.0.154.81    ip-10-0-64-125.us-west-2.compute.internal    <none>           <none>

Output of metric aggregator!
Here on 10.0.64.125 server is not running so we see failure across all the client pods for the same IP.

[
  {
    "SuccessCount": 5,
    "FailureCount": 1,
    "SourcePod": "traffic-client-nsb9d",
    "Failures": [
      {
        "DestinationIP": " 10.0.64.125:2273",
        "FailureReason": "failed to connect to server dial tcp: lookup  10.0.64.125: no such host"
      }
    ]
  },
  {
    "SuccessCount": 5,
    "FailureCount": 1,
    "SourcePod": "traffic-client-wrb8f",
    "Failures": [
      {
        "DestinationIP": " 10.0.64.125:2273",
        "FailureReason": "failed to connect to server dial tcp: lookup  10.0.64.125: no such host"
      }
    ]
  },
  {
    "SuccessCount": 5,
    "FailureCount": 1,
    "SourcePod": "traffic-client-5rbn8",
    "Failures": [
      {
        "DestinationIP": " 10.0.64.125:2273",
        "FailureReason": "failed to connect to server dial tcp: lookup  10.0.64.125: no such host"
      }
    ]
  },
  {
    "SuccessCount": 5,
    "FailureCount": 1,
    "SourcePod": "traffic-client-fktg9",
    "Failures": [
      {
        "DestinationIP": " 10.0.64.125:2273",
        "FailureReason": "failed to connect to server dial tcp: lookup  10.0.64.125: no such host"
      }
    ]
  }
]

The README.md has the complete description for the purpose for the change.

Which issue does this PR fix:
Automation Test

What does this PR do / Why do we need it:
Automation Test

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
NA

Testing done on this change:
Yes, tested locally.

Automation added to e2e:
Yes

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No

Does this change require updates to the CNI daemonset config files to work?:
No

Does this PR introduce any user-facing change?:
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

fawadkhaliq · 2021-04-30T17:22:23Z

test/agent/README.md

@@ -0,0 +1,71 @@
+###Test Agent


Good to see this approach for testing 👍

fawadkhaliq · 2021-04-30T17:24:40Z

test/agent/cmd/networking/tester/network.go

+	secondaryRouteTableIndex := make(map[int]bool)
+
+	// For each Pod validate the Pod networking
+	for _, pod := range podNetworkingValidationInput.PodList {


Instead of passing the list of pods from the caller, consider running this as a daemon on every node and let this pull what pods are running on the node this daemon is on, run all validation checks for each pod and report the results. You can scale this framework to large number of nodes/pods this way

fawadkhaliq · 2021-04-30T17:26:56Z

test/agent/cmd/networking/tester/network.go

+		log.Printf("validated route table for secondary ENI %d has right routes", index)
+	}
+
+	return validationErrors


Would be nice to have a struct where you can define pod and failure type(s) - will make debugging easier

fawadkhaliq · 2021-04-30T17:30:51Z

test/agent/cmd/metric-server/main.go

+
+var connectivityMetric []input.TestStatus
+
+// metric server stores metrics from test client and returns the aggregated metrics to the


It's be nice to expose different types of metrics/failures/successes add labels on each metric and additionally expose them in prometheus format. For log running validation and some other use-cases, it will be come in handy

fawadkhaliq · 2021-04-30T17:36:58Z

test/agent/README.md

+###Test Agent
+The test agent contains multiple binaries that are used by Ginkgo Automation tests. 
+
+###List of Go Binaries in the Agent


###List of Go Binaries in the Agent > ### List of Go Binaries in the Agent so it renders correctly. Same for all the other places

fawadkhaliq approved these changes Apr 30, 2021

View reviewed changes

fawadkhaliq reviewed Apr 30, 2021

View reviewed changes

abhipth added 2 commits May 3, 2021 22:21

add agent for testing pod networking

26a633e

fix formatting in makefile and minor bugs

0c174c3

abhipth force-pushed the ip-rules branch from 3d60fd4 to 0c174c3 Compare May 3, 2021 19:23

fawadkhaliq merged commit 6e58f7f into aws:master May 3, 2021

abhipth deleted the ip-rules branch May 5, 2021 05:55

This was referenced Jun 10, 2021

🥳 aws-vpc-cni v1.8.0 Automated Release! 🥑 aws/eks-charts#536

Closed

🥳 aws-vpc-cni v1.8.0 Automated Release! 🥑 aws/eks-charts#538

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add agent for testing pod networking #1448

add agent for testing pod networking #1448

abhipth commented Apr 29, 2021 •

edited

Loading

fawadkhaliq Apr 30, 2021

fawadkhaliq Apr 30, 2021

fawadkhaliq Apr 30, 2021

fawadkhaliq Apr 30, 2021

fawadkhaliq Apr 30, 2021


		var connectivityMetric []input.TestStatus

		// metric server stores metrics from test client and returns the aggregated metrics to the

add agent for testing pod networking #1448

add agent for testing pod networking #1448

Conversation

abhipth commented Apr 29, 2021 • edited Loading

fawadkhaliq Apr 30, 2021

Choose a reason for hiding this comment

fawadkhaliq Apr 30, 2021

Choose a reason for hiding this comment

fawadkhaliq Apr 30, 2021

Choose a reason for hiding this comment

fawadkhaliq Apr 30, 2021

Choose a reason for hiding this comment

fawadkhaliq Apr 30, 2021

Choose a reason for hiding this comment

abhipth commented Apr 29, 2021 •

edited

Loading