datadog chart makes bad decision on Service internalTrafficPolicy setting in K8s/EKS 1.22 #625

medavisjr · 2022-05-16T15:43:32Z

Describe what happened:

Running on EKS 1.22, the Datadog chart automatically enables the agent service's internal traffic policy for local routing, stating that the feature gate for this is beta and automatically enabled in K8s 1.22+. This is incorrect. The feature is still in alpha state in 1.22.

This causes all requests to the agent service the Helm chart creates to fail in K8s 1.22, unless the ServiceInternalTrafficPolicy feature gate is enabled by the K8s admin.

What's worse, this is impossible to do on EKS, as EKS does not support alpha feature gates at all, and it is not possible to enable them manually.

Describe what you expected:

The chart should have correct logic and should not deploy the Datadog agent Service resource with spec.InternalTrafficPolicy: Local when running on K8s 1.22. On this version of K8s, this should be an opt-in, not an opt-out.

Steps to reproduce the issue:

Be in a EKS cluster at version 1.22

~
❯ kubectl version --short=true
Client Version: v1.23.4
Server Version: v1.22.6-eks-7d68063

~
❯ kubectl get nodes
NAME                     STATUS   ROLES    AGE    VERSION
<ip>.ec2.internal   Ready    <none>   146m   v1.22.6-eks-7d68063
<ip>.ec2.internal   Ready    <none>   37d    v1.22.6-eks-7d68063
<ip>.ec2.internal   Ready    <none>   146m   v1.22.6-eks-7d68063
<ip>.ec2.internal    Ready    <none>   38d    v1.22.6-eks-7d68063
<ip>.ec2.internal   Ready    <none>   8h     v1.22.6-eks-7d68063
<ip>.ec2.internal    Ready    <none>   38d    v1.22.6-eks-7d68063
<ip>.ec2.internal    Ready    <none>   146m   v1.22.6-eks-7d68063
<ip>.ec2.internal    Ready    <none>   146m   v1.22.6-eks-7d68063
<ip>.ec2.internal    Ready    <none>   38d    v1.22.6-eks-7d68063
<ip>.ec2.internal     Ready    <none>   137m   v1.22.6-eks-7d68063

Deploy the Datadog helm chart with no custom values for agents.localService.*. Full values file contents for the curious:

registry: public.ecr.aws/datadog

datadog:
  clusterName: <redacted>
  criSocketPath: /var/run/containerd/containerd.sock
  dogstatsd:
    port: 8125
    nonLocalTraffic: true
  apm:
    portEnabled: true
    port: 8126
  env:
  - name: DD_AUTOCONFIG_INCLUDE_FEATURES
    value: "containerd"
  logs:
    enabled: true
    containerCollectAll: true

clusterAgent:
  enabled: true
  rbac:
    create: true

agents:
  podSecurity:
    apparmor:
      enabled: false

Try to open a connection or send a datagram packet via TCP or UDP from a Datadog agent pod or any other pod to the Datadog agent service, and watch it fail.

~
❯ kubectl -n datadog exec -it datadog-agent-cluster-agent-749f4d6c5-wxmpc -- /bin/bash
root@datadog-agent-cluster-agent-749f4d6c5-wxmpc:/# telnet datadog-agent.datadog 8126
Trying 172.20.94.37...
telnet: connect to address 172.20.94.37: Connection timed out

Manually edit the Datadog agent service and change spec.internalTrafficPolicy value from Local to Cluster:

~/code/tf-aws-re/sovrn-aws-core/core/us-east-1/eks/worker_nodes/spot_instance/re master
❯ kubectl -n datadog get service datadog-agent -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    meta.helm.sh/release-name: datadog-agent
    meta.helm.sh/release-namespace: datadog
  creationTimestamp: "2022-04-12T22:55:28Z"
  labels:
    app: datadog-agent
    app.kubernetes.io/instance: datadog-agent
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: datadog-agent
    app.kubernetes.io/version: "7"
    chart: datadog-2.33.4
    helm.sh/chart: datadog-2.33.4
    heritage: Helm
    release: datadog-agent
  name: datadog-agent
  namespace: datadog
  resourceVersion: "431532510"
  uid: dfe06d19-b496-493b-81b8-6e5f2b3bf85e
spec:
  clusterIP: 172.20.94.37
  clusterIPs:
  - 172.20.94.37
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: dogstatsd
    port: 8125
    protocol: UDP
    targetPort: 8125
  - name: traceport
    port: 8126
    protocol: TCP
    targetPort: 8126
  selector:
    app: datadog-agent
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

Try the request again and see that it works:

~
❯ kubectl -n datadog exec -it datadog-agent-cluster-agent-749f4d6c5-wxmpc -- /bin/bash
root@datadog-agent-cluster-agent-749f4d6c5-wxmpc:/# telnet datadog-agent.datadog 8126
Trying 172.20.94.37...
Connected to datadog-agent.datadog.svc.cluster.local.
Escape character is '^]'.
^]

Additional environment details (Operating System, Cloud provider, etc):

AWS EKS 1.22 on Bottlerocket w/ containerd

The text was updated successfully, but these errors were encountered:

clamoriniere · 2022-05-16T16:02:37Z

Hi @rodalli ,

maybe the doc is not up-to-date, but I check again the release note, the feature move to beta thanks to this PR: kubernetes/kubernetes#103462

I'm guessing the issue maybe something else.
could you confirm that a Daemonset datadog-agent pod is running on the same node than the cluster-agent pod from which you have run the telnet command. Because moving to cluster means that any datadog-agent pod could have receive the tcp message.

medavisjr · 2022-05-16T16:53:42Z

Yes, there is a Daemonset running, and the Service has the correct endpoints on each node in the cluster.

~
❯ kubectl get daemonset -n datadog
NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
datadog-agent   10        10        10      10           10          kubernetes.io/os=linux   33d

~
❯ kubectl get pods -n datadog -l app.kubernetes.io/component=agent
NAME                  READY   STATUS    RESTARTS   AGE
datadog-agent-2bxxs   3/3     Running   0          3h32m
datadog-agent-7sbvd   3/3     Running   0          9h
datadog-agent-9dfmr   3/3     Running   0          3h23m
datadog-agent-d76dl   3/3     Running   0          3h32m
datadog-agent-dw4w4   3/3     Running   0          4d21h
datadog-agent-jx2pn   3/3     Running   0          4d21h
datadog-agent-n2d82   3/3     Running   0          4d21h
datadog-agent-r6vqh   3/3     Running   0          3h32m
datadog-agent-t825v   3/3     Running   0          3h32m
datadog-agent-zdvs5   3/3     Running   0          4d21h

I know for sure it's not that the pods aren't there or are misconfigured or anything like that, as I've been troubleshooting this issue with Datadog support for over a week now.

Now I'm starting to wonder if for some reason AWS didn't get the memo on ServiceInternalTrafficPolicy going to beta status in 1.22. I'm trying to confirm whether or not it's enabled in our 1.22.6 cluster. More info to come.

medavisjr · 2022-05-16T18:28:18Z

As far as I can tell, the feature is default in 1.22. Even though the doc I originally linked says otherwise, the Feature Gate doc for v1.22 states that it's "beta" stage and enabled by default in 1.22. However, if somehow it's not, AWS EKS v1.22 doesn't enable it in it's feature flags. Here's the relevant API log message from my EKS 1.22 cluster

2022-05-16T12:09:16.000-06:00 | I0516 18:09:16.937580 9 flags.go:59] FLAG: --feature-gates="CSIServiceAccountToken=true,ExternalKeyService=true,TTLAfterFinished=true"

medavisjr · 2022-05-16T18:35:55Z

Regardless of what the default behavior is/isn't in EKS 1.22, it seems like bad design to force the setting of internalLocalTrafficPolicy: Local for the agent service. Yes, it is a better choice, all things being equal.

But, if for some reason the setting causes issues in a k8s cluster (like it appears to be doing in mine), the chart doesn't provide a values setting to control which way this gets set to Local or Cluster on the service. Instead, it detects the version of k8s running and makes the decision without giving the chart user the ability to have control over this behavior.

clamoriniere · 2022-05-16T18:47:05Z

We do have an option to disable the service creation. But we don't want to use the service with the "cluster" option. I let you read this comment that I made in another issue to explain why.

If local traffic policy is not available, the 2 others solution the hostPort or the Uds socket. But it is very important to target the agent on the same node to get all the features working as expected.

medavisjr · 2022-05-16T18:57:51Z

Gotcha, so this is actually a hard requirement. That makes sense.

Back to the drawing board on why this doesn't seem to be working as expected in my EKS 1.22 cluster, I suppose.

clamoriniere · 2022-05-16T20:03:15Z

Unfortunately yes.

could you please contact our support to better track the issue and have someone that try to reproduce the problem on EKS 1.22. 🙇

anthonyralston · 2022-05-31T22:34:09Z

@rodalli Were you able to address this issue on EKS 1.22 in the end?

medavisjr · 2022-06-01T02:45:11Z

@rodalli Were you able to address this issue on EKS 1.22 in the end?

No, Datadog Support team and I were not able to figure out the issue. I actually have a support case open with AWS now. Still no definitive answer yet, but it seems like it might have something to do with self-managed nodes vs. using EKS managed node groups (where internalTrafficPolicy: Local seems to work fine).

adrianmoisey · 2023-04-03T08:54:49Z

We seem to be having similar problems to this issue.
It seems as thought the internalTrafficPolicy: Local sometimes doesn't work as expected.
If we rollout a change to our datadog daemonset, then we end up with missing metrics.

We believe that we're hitting a bug in Kubernetes which is causing it to not delete the conntrack entry for the traffic for a stale connection.
It seems like this bug exists in various version of Kubernetes, and their issue tracker doesn't have a clear indication as to where it's fixed or not.
Here are some of the useful issues we've found:

We can also reliably reproduce this issue in all of our kubernetes clusters, on both AWS and GKE.

I have opened an issue with Datadog support too.

jukie · 2024-06-05T03:16:51Z

@adrianmoisey did you happen to find anything? I think I'm hitting the same issue

adrianmoisey · 2024-06-05T08:19:19Z

Yup, I think this bug is fixed in Kubernetes 1.29 with this PR: kubernetes/kubernetes#119394

Datadog made this change too soon, and should have made it configurable.

vboulineau · 2024-06-05T11:51:06Z

Yup, I think this bug is fixed in Kubernetes 1.29 with this PR: kubernetes/kubernetes#119394

Datadog made this change too soon, and should have made it configurable.

The existence of the service is not configurable because it's harmless. It then depends on sender to use it or not. When senders are configured through our admission controller, you can use clusterAgent.admissionController.configMode https://github.com/DataDog/helm-charts/blob/main/charts/datadog/values.yaml#L1086 to choose hostip or socket.

clamoriniere added the chart/datadog This issue or pull request is related to the datadog chart label May 16, 2022

clamoriniere added the bug Something isn't working label May 16, 2022

medavisjr closed this as completed May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datadog chart makes bad decision on Service internalTrafficPolicy setting in K8s/EKS 1.22 #625

datadog chart makes bad decision on Service internalTrafficPolicy setting in K8s/EKS 1.22 #625

medavisjr commented May 16, 2022 •

edited

Loading

clamoriniere commented May 16, 2022

medavisjr commented May 16, 2022 •

edited

Loading

medavisjr commented May 16, 2022 •

edited

Loading

medavisjr commented May 16, 2022 •

edited

Loading

clamoriniere commented May 16, 2022

medavisjr commented May 16, 2022

clamoriniere commented May 16, 2022

anthonyralston commented May 31, 2022

medavisjr commented Jun 1, 2022

adrianmoisey commented Apr 3, 2023

jukie commented Jun 5, 2024

adrianmoisey commented Jun 5, 2024

vboulineau commented Jun 5, 2024

datadog chart makes bad decision on Service internalTrafficPolicy setting in K8s/EKS 1.22 #625

datadog chart makes bad decision on Service internalTrafficPolicy setting in K8s/EKS 1.22 #625

Comments

medavisjr commented May 16, 2022 • edited Loading

clamoriniere commented May 16, 2022

medavisjr commented May 16, 2022 • edited Loading

medavisjr commented May 16, 2022 • edited Loading

medavisjr commented May 16, 2022 • edited Loading

clamoriniere commented May 16, 2022

medavisjr commented May 16, 2022

clamoriniere commented May 16, 2022

anthonyralston commented May 31, 2022

medavisjr commented Jun 1, 2022

adrianmoisey commented Apr 3, 2023

jukie commented Jun 5, 2024

adrianmoisey commented Jun 5, 2024

vboulineau commented Jun 5, 2024

medavisjr commented May 16, 2022 •

edited

Loading

medavisjr commented May 16, 2022 •

edited

Loading

medavisjr commented May 16, 2022 •

edited

Loading

medavisjr commented May 16, 2022 •

edited

Loading