Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datadog chart makes bad decision on Service internalTrafficPolicy setting in K8s/EKS 1.22 #625

Closed
medavisjr opened this issue May 16, 2022 · 13 comments
Labels
bug Something isn't working chart/datadog This issue or pull request is related to the datadog chart

Comments

@medavisjr
Copy link

medavisjr commented May 16, 2022

Describe what happened:

Running on EKS 1.22, the Datadog chart automatically enables the agent service's internal traffic policy for local routing, stating that the feature gate for this is beta and automatically enabled in K8s 1.22+. This is incorrect. The feature is still in alpha state in 1.22.

This causes all requests to the agent service the Helm chart creates to fail in K8s 1.22, unless the ServiceInternalTrafficPolicy feature gate is enabled by the K8s admin.

What's worse, this is impossible to do on EKS, as EKS does not support alpha feature gates at all, and it is not possible to enable them manually.

Describe what you expected:

The chart should have correct logic and should not deploy the Datadog agent Service resource with spec.InternalTrafficPolicy: Local when running on K8s 1.22. On this version of K8s, this should be an opt-in, not an opt-out.

Steps to reproduce the issue:

  1. Be in a EKS cluster at version 1.22
~
❯ kubectl version --short=true
Client Version: v1.23.4
Server Version: v1.22.6-eks-7d68063

~
❯ kubectl get nodes
NAME                     STATUS   ROLES    AGE    VERSION
<ip>.ec2.internal   Ready    <none>   146m   v1.22.6-eks-7d68063
<ip>.ec2.internal   Ready    <none>   37d    v1.22.6-eks-7d68063
<ip>.ec2.internal   Ready    <none>   146m   v1.22.6-eks-7d68063
<ip>.ec2.internal    Ready    <none>   38d    v1.22.6-eks-7d68063
<ip>.ec2.internal   Ready    <none>   8h     v1.22.6-eks-7d68063
<ip>.ec2.internal    Ready    <none>   38d    v1.22.6-eks-7d68063
<ip>.ec2.internal    Ready    <none>   146m   v1.22.6-eks-7d68063
<ip>.ec2.internal    Ready    <none>   146m   v1.22.6-eks-7d68063
<ip>.ec2.internal    Ready    <none>   38d    v1.22.6-eks-7d68063
<ip>.ec2.internal     Ready    <none>   137m   v1.22.6-eks-7d68063
  1. Deploy the Datadog helm chart with no custom values for agents.localService.*. Full values file contents for the curious:
registry: public.ecr.aws/datadog

datadog:
  clusterName: <redacted>
  criSocketPath: /var/run/containerd/containerd.sock
  dogstatsd:
    port: 8125
    nonLocalTraffic: true
  apm:
    portEnabled: true
    port: 8126
  env:
  - name: DD_AUTOCONFIG_INCLUDE_FEATURES
    value: "containerd"
  logs:
    enabled: true
    containerCollectAll: true

clusterAgent:
  enabled: true
  rbac:
    create: true

agents:
  podSecurity:
    apparmor:
      enabled: false
  1. Try to open a connection or send a datagram packet via TCP or UDP from a Datadog agent pod or any other pod to the Datadog agent service, and watch it fail.
~
❯ kubectl -n datadog exec -it datadog-agent-cluster-agent-749f4d6c5-wxmpc -- /bin/bash
root@datadog-agent-cluster-agent-749f4d6c5-wxmpc:/# telnet datadog-agent.datadog 8126
Trying 172.20.94.37...
telnet: connect to address 172.20.94.37: Connection timed out
  1. Manually edit the Datadog agent service and change spec.internalTrafficPolicy value from Local to Cluster:
~/code/tf-aws-re/sovrn-aws-core/core/us-east-1/eks/worker_nodes/spot_instance/re master
❯ kubectl -n datadog get service datadog-agent -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    meta.helm.sh/release-name: datadog-agent
    meta.helm.sh/release-namespace: datadog
  creationTimestamp: "2022-04-12T22:55:28Z"
  labels:
    app: datadog-agent
    app.kubernetes.io/instance: datadog-agent
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: datadog-agent
    app.kubernetes.io/version: "7"
    chart: datadog-2.33.4
    helm.sh/chart: datadog-2.33.4
    heritage: Helm
    release: datadog-agent
  name: datadog-agent
  namespace: datadog
  resourceVersion: "431532510"
  uid: dfe06d19-b496-493b-81b8-6e5f2b3bf85e
spec:
  clusterIP: 172.20.94.37
  clusterIPs:
  - 172.20.94.37
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: dogstatsd
    port: 8125
    protocol: UDP
    targetPort: 8125
  - name: traceport
    port: 8126
    protocol: TCP
    targetPort: 8126
  selector:
    app: datadog-agent
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
  1. Try the request again and see that it works:
~
❯ kubectl -n datadog exec -it datadog-agent-cluster-agent-749f4d6c5-wxmpc -- /bin/bash
root@datadog-agent-cluster-agent-749f4d6c5-wxmpc:/# telnet datadog-agent.datadog 8126
Trying 172.20.94.37...
Connected to datadog-agent.datadog.svc.cluster.local.
Escape character is '^]'.
^]

Additional environment details (Operating System, Cloud provider, etc):

AWS EKS 1.22 on Bottlerocket w/ containerd

@clamoriniere
Copy link
Collaborator

Hi @rodalli ,

maybe the doc is not up-to-date, but I check again the release note, the feature move to beta thanks to this PR: kubernetes/kubernetes#103462

I'm guessing the issue maybe something else.
could you confirm that a Daemonset datadog-agent pod is running on the same node than the cluster-agent pod from which you have run the telnet command. Because moving to cluster means that any datadog-agent pod could have receive the tcp message.

@clamoriniere clamoriniere added the chart/datadog This issue or pull request is related to the datadog chart label May 16, 2022
@medavisjr
Copy link
Author

medavisjr commented May 16, 2022

Yes, there is a Daemonset running, and the Service has the correct endpoints on each node in the cluster.

~
❯ kubectl get daemonset -n datadog
NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
datadog-agent   10        10        10      10           10          kubernetes.io/os=linux   33d

~
❯ kubectl get pods -n datadog -l app.kubernetes.io/component=agent
NAME                  READY   STATUS    RESTARTS   AGE
datadog-agent-2bxxs   3/3     Running   0          3h32m
datadog-agent-7sbvd   3/3     Running   0          9h
datadog-agent-9dfmr   3/3     Running   0          3h23m
datadog-agent-d76dl   3/3     Running   0          3h32m
datadog-agent-dw4w4   3/3     Running   0          4d21h
datadog-agent-jx2pn   3/3     Running   0          4d21h
datadog-agent-n2d82   3/3     Running   0          4d21h
datadog-agent-r6vqh   3/3     Running   0          3h32m
datadog-agent-t825v   3/3     Running   0          3h32m
datadog-agent-zdvs5   3/3     Running   0          4d21h

I know for sure it's not that the pods aren't there or are misconfigured or anything like that, as I've been troubleshooting this issue with Datadog support for over a week now.

Now I'm starting to wonder if for some reason AWS didn't get the memo on ServiceInternalTrafficPolicy going to beta status in 1.22. I'm trying to confirm whether or not it's enabled in our 1.22.6 cluster. More info to come.

@medavisjr
Copy link
Author

medavisjr commented May 16, 2022

As far as I can tell, the feature is default in 1.22. Even though the doc I originally linked says otherwise, the Feature Gate doc for v1.22 states that it's "beta" stage and enabled by default in 1.22. However, if somehow it's not, AWS EKS v1.22 doesn't enable it in it's feature flags. Here's the relevant API log message from my EKS 1.22 cluster

2022-05-16T12:09:16.000-06:00 | I0516 18:09:16.937580 9 flags.go:59] FLAG: --feature-gates="CSIServiceAccountToken=true,ExternalKeyService=true,TTLAfterFinished=true"

@medavisjr
Copy link
Author

medavisjr commented May 16, 2022

Regardless of what the default behavior is/isn't in EKS 1.22, it seems like bad design to force the setting of internalLocalTrafficPolicy: Local for the agent service. Yes, it is a better choice, all things being equal.

But, if for some reason the setting causes issues in a k8s cluster (like it appears to be doing in mine), the chart doesn't provide a values setting to control which way this gets set to Local or Cluster on the service. Instead, it detects the version of k8s running and makes the decision without giving the chart user the ability to have control over this behavior.

@clamoriniere
Copy link
Collaborator

We do have an option to disable the service creation. But we don't want to use the service with the "cluster" option. I let you read this comment that I made in another issue to explain why.

If local traffic policy is not available, the 2 others solution the hostPort or the Uds socket. But it is very important to target the agent on the same node to get all the features working as expected.

@medavisjr
Copy link
Author

Gotcha, so this is actually a hard requirement. That makes sense.

Back to the drawing board on why this doesn't seem to be working as expected in my EKS 1.22 cluster, I suppose.

@clamoriniere
Copy link
Collaborator

Unfortunately yes.

could you please contact our support to better track the issue and have someone that try to reproduce the problem on EKS 1.22. 🙇

@clamoriniere clamoriniere added the bug Something isn't working label May 16, 2022
@anthonyralston
Copy link

@rodalli Were you able to address this issue on EKS 1.22 in the end?

@medavisjr
Copy link
Author

@rodalli Were you able to address this issue on EKS 1.22 in the end?

No, Datadog Support team and I were not able to figure out the issue. I actually have a support case open with AWS now. Still no definitive answer yet, but it seems like it might have something to do with self-managed nodes vs. using EKS managed node groups (where internalTrafficPolicy: Local seems to work fine).

@adrianmoisey
Copy link

We seem to be having similar problems to this issue.
It seems as thought the internalTrafficPolicy: Local sometimes doesn't work as expected.
If we rollout a change to our datadog daemonset, then we end up with missing metrics.

We believe that we're hitting a bug in Kubernetes which is causing it to not delete the conntrack entry for the traffic for a stale connection.
It seems like this bug exists in various version of Kubernetes, and their issue tracker doesn't have a clear indication as to where it's fixed or not.
Here are some of the useful issues we've found:

We can also reliably reproduce this issue in all of our kubernetes clusters, on both AWS and GKE.

I have opened an issue with Datadog support too.

@jukie
Copy link

jukie commented Jun 5, 2024

@adrianmoisey did you happen to find anything? I think I'm hitting the same issue

@adrianmoisey
Copy link

Yup, I think this bug is fixed in Kubernetes 1.29 with this PR: kubernetes/kubernetes#119394

Datadog made this change too soon, and should have made it configurable.

@vboulineau
Copy link
Contributor

Yup, I think this bug is fixed in Kubernetes 1.29 with this PR: kubernetes/kubernetes#119394

Datadog made this change too soon, and should have made it configurable.

The existence of the service is not configurable because it's harmless. It then depends on sender to use it or not. When senders are configured through our admission controller, you can use clusterAgent.admissionController.configMode https://github.com/DataDog/helm-charts/blob/main/charts/datadog/values.yaml#L1086 to choose hostip or socket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working chart/datadog This issue or pull request is related to the datadog chart
Projects
None yet
Development

No branches or pull requests

6 participants