revise Datadog trace sampling configuration #10151

dgoffredo · 2023-06-29T16:24:25Z

Datadog customers have begun to report that trace sampling is not behaving as expected when using ingress-nginx.

With other Datadog integrations, the default sampling behavior is to consult the Datadog Agent for a sample rate, which changes dynamically. This way, trace volume can be centrally controlled. A customer may optionally specify a fixed sampling rate; but if they don't, the default behavior is to let the Datadog Agent figure it out.

I made a change in Datadog's library last March that changed the meaning of sample_rate in the library's configuration. sample_rate corresponds to DatadogSampleRate in ingress-nginx. Previously, sample_rate was ignored by Datadog's library. This was a bug, but not a severe one, because the concept of "sampling rules" had since superceded what sample_rate used to configure. My change in March repurposed sample_rate to mean "append a sampling rule that matches all traces."

What I overlooked was the fact that ingress-nginx still uses sample_rate, and that it always specifies a value for it in /etc/nginx/opentracing.json, defaulting to 1.0.

This means that Datadog customers, since my change, have no way to say "use the rates calculated by the Datadog Agent." They can set DatadogSampleRate, and if they don't, they still get 1.0 instead of the desired default behavior.

The changes that I propose in this PR should have been proposed last March, but I didn't then notice this interaction.

These changes remove the DatadogPrioritySampling flag (which has not done anything for quite a long time), and change the type of DatadogSampleRate from float32 to *float32. This way, the default value is nil rather than 1.0, and we can detect this when constructing /etc/nginx/opentracing.json.

Conditionally including "sample_rate" in the generated JSON required me to rearrange the code that produces /etc/nginx/opentracing.json. Previously, the file content was chosen from one of multiple text/template templates. Such templates cannot, as far as I know, express conditionally included text based on the value of a pointer. Instead, I use encoding/json in a dedicated function to generate the Datadog JSON.

This will change the default sampling behavior of the Datadog integration, which is something that I'd like to mention in ingress-nginx's release notes should these changes be merged in their current form.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
CVE Report (Scanner found CVE and adding report)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation only

Which issue/s this PR fixes

The issue was not with ingress-nginx, but with behavior change in ingress-nginx brought on by a change in dd-opentracing-cpp.

The concern was raised in Datadog's support channels.

How Has This Been Tested?

Manual integration testing involved a few Datadog-specific files:

agent.dockerfile: Dockerfile for the mock Datadog Agent.
agent.js: Node.js HTTP server used as a mock Datadog Agent.
agent.yaml: Kubernetes DaemonSet resource that uses the image built by agent.dockerfile.
httpbin.yaml: Example HTTP Service and a corresponding Ingress.

Run make dev-env.

Apply the files above (this involves loading the built Docker image into the cluster, similarly to what is done in make dev-env).

Now requests made to the host's port 80 will flow through the NGINX ingress to httpbin.

Edit the ingress controller's Deployment to expose the node's IP address. That's where the mock Datadog Agent will be listening (because it's a DaemonSet, there's an instance on each node):

# ...
        env:
        - name: HOST_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
# ...

Edit the ingress controller's ConfigMap to enable Datadog tracing:

apiVersion: v1
data:
  datadog-collector-host: $HOST_IP
  enable-opentracing: "true"
kind: ConfigMap
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx

Verify that httpbin's /headers endpoint shows Datadog tracing propagation headers:

ubuntu@dgoffredo-devbox:~$ curl 'http://localhost/headers'
{
  "headers": {
    "Accept": "*/*", 
    "Host": "localhost", 
    "User-Agent": "curl/7.81.0", 
    "X-Datadog-Parent-Id": "4176797220720185106", 
    "X-Datadog-Sampling-Priority": "1", 
    "X-Datadog-Tags": "_dd.p.dm=-0", 
    "X-Datadog-Trace-Id": "356006255528867060", 
    "X-Forwarded-Host": "localhost", 
    "X-Forwarded-Scheme": "http", 
    "X-Scheme": "http"
  }
}

Verify that the default /etc/nginx/opentracing.json omits "sample_rate":

ubuntu@dgoffredo-devbox:~$ kubectl -n ingress-nginx get pods
NAME                                       READY   STATUS      RESTARTS   AGE
ingress-nginx-admission-create-nk8q9       0/1     Completed   0          2d17h
ingress-nginx-admission-patch-p8qk7        0/1     Completed   0          2d17h
ingress-nginx-controller-56fc94fb8-zbkdg   1/1     Running     0          22h
ubuntu@dgoffredo-devbox:~$ kubectl -n ingress-nginx exec -it ingress-nginx-controller-56fc94fb8-zbkdg -- cat /etc/nginx/opentracing.json | jq
{
  "agent_host": "172.18.0.2",
  "agent_port": 8126,
  "environment": "prod",
  "operation_name_override": "nginx.handle",
  "service": "nginx"
}

Edit the ingress controller's ConfigMap to specify an explicit sampling rate:

apiVersion: v1
kind: ConfigMap
data:
  datadog-collector-host: $HOST_IP
  datadog-sample-rate: "0.42"
  enable-opentracing: "true"

Send another request to httpbin, and check the log output of the mock Datadog Agent. Verify that the tagged sampling rate (_dd.rule_psr) is as configured.

ubuntu@dgoffredo-devbox:~$ kubectl -n datadog logs --follow dd-trace-agent-tnsgj
[
  [
    {
      "name": "nginx.handle",
      "service": "nginx",
      "resource": "/",
      "type": "web",
      "start": 1688055351332243000,
      "duration": 2471860,
      "meta": {
        "http.url": "http://localhost/headers",
        "upstream.name": "upstream_balancer",
        "http.method": "GET",
        "http.status_code": "200",
        "http.host": "localhost",
        "peer.address": "172.18.0.1:50500",
        "nginx.worker_pid": "96",
        "http.status_line": "200 OK",
        "component": "nginx",
        "upstream.address": "10.244.0.8:80",
        "env": "prod",
        "operation": "/"
      },
      "metrics": {},
      "span_id": 1493341588522438100,
      "trace_id": 2107002113598878000,
      "parent_id": 2107002113598878000,
      "error": 0
    },
    {
      "name": "nginx.handle",
      "service": "nginx",
      "resource": "/",
      "type": "web",
      "start": 1688055351332218400,
      "duration": 2515420,
      "meta": {
        "http.url": "http://localhost/headers",
        "upstream.name": "upstream_balancer",
        "http.method": "GET",
        "nginx.worker_pid": "96",
        "_dd.p.dm": "-3",
        "component": "nginx",
        "http.status_line": "200 OK",
        "http.host": "localhost",
        "peer.address": "172.18.0.1:50500",
        "http.status_code": "200",
        "upstream.address": "10.244.0.8:80",
        "env": "prod",
        "operation": "/"
      },
      "metrics": {
        "_dd.rule_psr": 0.42,
        "_sampling_priority_v1": -1
      },
      "span_id": 2107002113598878000,
      "trace_id": 2107002113598878000,
      "parent_id": 0,
      "error": 0
    }
  ]
]

Note that it's 0.42, as expected.

Checklist:

My change requires a change to the documentation.
I have updated the documentation accordingly.
I've read the CONTRIBUTION guide
I have added unit and/or e2e tests to cover my changes.
All new and existing tests passed.

netlify · 2023-06-29T16:24:29Z

✅ Deploy Preview for kubernetes-ingress-nginx canceled.

Name	Link
🔨 Latest commit	`7712ef2`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-ingress-nginx/deploys/649de8e56a2b53000758f287

k8s-ci-robot · 2023-06-29T16:24:33Z

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-06-29T16:24:34Z

Hi @dgoffredo. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tao12345666333

/ok-to-test

Thanks for your contributions.

rikatz · 2023-07-06T23:42:11Z

Error apparently was due to a github action problem, triggering here

rikatz · 2023-07-06T23:45:00Z

/lgtm
/approve
Thanks!

k8s-ci-robot · 2023-07-06T23:45:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoffredo, rikatz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rikatz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dgoffredo · 2023-07-07T17:33:23Z

Thanks, @rikatz!

strongjz · 2023-07-20T17:41:42Z

/cherry-pick release-1.8

k8s-infra-cherrypick-robot · 2023-07-20T17:42:19Z

@strongjz: new pull request created: #10224

In response to this:

/cherry-pick release-1.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dgoffredo added 3 commits June 27, 2023 19:25

datadog: sample_rate omitted by default

5f44dc7

config: use *float32 with nil instead of float32 with sentinel value

14d8bb0

change some names

ab780b7

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 29, 2023

k8s-ci-robot requested review from cpanato and tao12345666333 June 29, 2023 16:24

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 29, 2023

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Jun 29, 2023

k8s-ci-robot added needs-priority size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 29, 2023

dgoffredo changed the title ~~David.goffredo/datadog revise sampling~~ revise Datadog trace sampling configuration Jun 29, 2023

dgoffredo mentioned this pull request Jun 29, 2023

datadog: sample_rate omitted by default dgoffredo/ingress-nginx#1

Closed

gofmt -s -w internal/ingress/controller/nginx.go

7712ef2

tao12345666333 reviewed Jun 30, 2023

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 30, 2023

tao12345666333 mentioned this pull request Jun 30, 2023

Deprecate Jaeger and Opentracing in favor of OpenTelemetry #8687

Closed

k8s-ci-robot assigned rikatz Jul 6, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 6, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 6, 2023

k8s-ci-robot merged commit 6d55e1f into kubernetes:main Jul 6, 2023

k8s-infra-cherrypick-robot mentioned this pull request Jul 20, 2023

[release-1.8] revise Datadog trace sampling configuration #10224

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revise Datadog trace sampling configuration #10151

revise Datadog trace sampling configuration #10151

dgoffredo commented Jun 29, 2023 •

edited

Loading

netlify bot commented Jun 29, 2023 •

edited

Loading

k8s-ci-robot commented Jun 29, 2023

k8s-ci-robot commented Jun 29, 2023

tao12345666333 left a comment

rikatz commented Jul 6, 2023

rikatz commented Jul 6, 2023

k8s-ci-robot commented Jul 6, 2023

dgoffredo commented Jul 7, 2023

strongjz commented Jul 20, 2023

k8s-infra-cherrypick-robot commented Jul 20, 2023

revise Datadog trace sampling configuration #10151

revise Datadog trace sampling configuration #10151

Conversation

dgoffredo commented Jun 29, 2023 • edited Loading

Types of changes

Which issue/s this PR fixes

How Has This Been Tested?

Checklist:

netlify bot commented Jun 29, 2023 • edited Loading

✅ Deploy Preview for kubernetes-ingress-nginx canceled.

k8s-ci-robot commented Jun 29, 2023

k8s-ci-robot commented Jun 29, 2023

tao12345666333 left a comment

Choose a reason for hiding this comment

rikatz commented Jul 6, 2023

rikatz commented Jul 6, 2023

k8s-ci-robot commented Jul 6, 2023

dgoffredo commented Jul 7, 2023

strongjz commented Jul 20, 2023

k8s-infra-cherrypick-robot commented Jul 20, 2023

dgoffredo commented Jun 29, 2023 •

edited

Loading

netlify bot commented Jun 29, 2023 •

edited

Loading