Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDP TX drops every 10min when rate exceeds 15-20Kbps #2338

Closed
liggetm opened this issue Apr 12, 2018 · 7 comments
Closed

UDP TX drops every 10min when rate exceeds 15-20Kbps #2338

liggetm opened this issue Apr 12, 2018 · 7 comments

Comments

@liggetm
Copy link

liggetm commented Apr 12, 2018

NGINX Ingress controller version:
0.12.0
(from quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.12.0)

Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.4", GitCommit:"7243c69eb523aa4377bce883e7c0dd76b84709a1", GitTreeState:"clean", BuildDate:"2017-03-07T23:53:09Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"269f928217957e7126dc87e6adfa82242bfe5b1e", GitTreeState:"clean", BuildDate:"2017-07-03T15:31:10Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration:
    Bare metal (on an HP Elitedesk 800 G3, i7, 32GB, 250GB SSD)

  • OS (e.g. from /etc/os-release):
    CentOS Atomic Host 1803

# atomic host status
State: idle
Deployments:
* centos-atomic-host:centos-atomic-host/7/x86_64/standard
                   Version: 7.1803 (2018-04-03 12:35:38)
                    Commit: cbb9dbf9c8697e9254f481fff8f399d6808cecbed0fa6cc24e659d2f50e05a3e
              GPGSignature: Valid signature by 64E3E7558572B59A319452AAF17E745691BA8335
# cat /etc/redhat-release
(CentOS Linux release 7.4.1708 (Core)

What happened:
Drops in UDP TX traffic from the ingress controller every 10mins for a period of approximately 90sec even though UDP RX traffic remains constant. This appears to happen at UDP rates > 15-20Kbps.

What you expected to happen:
No drops in traffic and no real difference between RX/TX rates when using UDP regardless of rate

How to reproduce it (as minimally and precisely as possible):
Configure a valid upstream UDP host, configure an ingress using a hostport to point to the upstream host via a Kubernetes service. My config snippets:

apiVersion: v1
kind: ConfigMap
metadata:
  name: my-ingress-udp
data:
  50000: "default/my-udp-svc:50000"

---

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: my-ingress-udp
spec:
  template:
    metadata:
      labels:
        my-app: my-ingress-udp
    spec:
      containers:
      - image:  quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.12.0
        name: nginx-ingress-lb
        readinessProbe:
          httpGet:
            path: /healthz
            port: 10254
            scheme: HTTP
        livenessProbe:
          httpGet:
            path: /healthz
            port: 10254
            scheme: HTTP
          initialDelaySeconds: 10
          timeoutSeconds: 1
        # use downward API
        env:
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
        ports:
        - containerPort: 50000
          hostPort: 50000
          protocol: UDP
        args:
        - /nginx-ingress-controller
        - --default-backend-service=$(POD_NAMESPACE)/my-backend-svc
        - --udp-services-configmap=$(POD_NAMESPACE)/my-ingress-udp

---

apiVersion: v1
kind: Service
metadata:
  name: my-udp-svc
spec:
  ports:
  - port: 50000
    name: telemetry
    protocol: UDP
    targetPort: telemetry
  selector:
    my-app: my-udp

---

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: my-udp
spec:
  replicas: 1
  template:
    metadata:
      labels:
        my-app: my-udp
    spec:
      containers:
      - name: my-udp
        image: registry:5000/my-udp-server:latest
        ports:
        - containerPort: 50000
          name: telemetry
          protocol: UDP

---

apiVersion: v1
kind: Service
metadata:
  name: my-backend-svc
spec:
  ports:
  - port: 80
    targetPort: 8080
  selector:
    my-app: my-ui

---

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: my-ui
spec:
  replicas: 1
  template:
    metadata:
      labels:
        my-app: my-ui
    spec:
      containers:
      - name:my-ui
        image: registry:5000/my-ui:latest
        ports:
        - containerPort: 8080
          protocol: TCP

Anything else we need to know:
I've uploaded an image from Grafana showing the TX drops over the course of an hour: https://imagebin.ca/v/3y7wU7cilhwF

@aledbf
Copy link
Member

aledbf commented Apr 12, 2018

@liggetm please check the pod logs searching for "reloading" and checking the timestamp against the drop of traffic to see if that's the issue.

@liggetm
Copy link
Author

liggetm commented Apr 12, 2018

@aledbf thanks for coming back to me. I didn't see any reloading at the same timestamp, but I did see alerts when I increasing the logging to verbose.

16384 worker_connections are not enough while connecting to upstream, udp client

Looking at the documentation it appears that the defaults are 16384 worker_connections per worker-process (@ 1 worker process per cpu). I'll try to increase the worker_connections, but I don't fully understand what it means in relation to UDP given it's connectionless.

@aledbf
Copy link
Member

aledbf commented Apr 13, 2018

@liggetm please check the generated nginx.con searching the value of worker_rlimit_nofile
You can do this using kubectl exec <ingress pod> cat /etc/nginx/nginx.conf

You can adjust worker_connections in the configuration configmap setting max-worker-connections: XX
This value cannot be higher than worker_rlimit_nofile

@liggetm
Copy link
Author

liggetm commented Apr 17, 2018

Thanks @aledbf - my config shows worker_rlimit_nofile 201874; - after setting the max-worker-connections: 65536 the issue with the traffic appears resolved.
When you say that max-worker-connections cannot exceed worker_rlimit_nofile - do you mean the total number of worker connections (ie worker_processes * worker_connections)?

@aledbf
Copy link
Member

aledbf commented Apr 17, 2018

do you mean the total number of worker connections (ie worker_processes * worker_connections)?

Yes.

Edit: worker_rlimit_nofile is per worker process

@aledbf
Copy link
Member

aledbf commented Apr 17, 2018

Can we close this?

@liggetm
Copy link
Author

liggetm commented Apr 17, 2018

Yes, thanks @aledbf for all your help!

@liggetm liggetm closed this as completed Apr 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants