No logs available #112

Resisty · 2017-03-24T17:51:21Z

I have kube-state-metrics running as a deployment via ansible on my clusters:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: kube-state-metrics
spec:
  replicas: 4
  template:
    metadata:
      labels:
        app: kube-state-metrics
    spec:
      containers:
      - name: kube-state-metrics
        image: gcr.io/google_containers/kube-state-metrics:v0.3.0
        ports:
        - name: metrics
          containerPort: 8080
        resources:
          requests:
            memory: {{ kube_state_mem_req }}
            cpu: 100m
          limits:
            memory: {{ kube_state_mem_lim }}
            cpu: 200m

I've had to bump kube_state_mem_(req|lim) to 800Mi in order to get the pods to stay up; the pods have started OOMKilling/CrashLoopBackoff'ing.

I'd like to know why, but the containers are basically inscrutable. There's no way to shell in and docker logs is empty.

It'd be great if there was more information on what's going on, please and thanks!

The text was updated successfully, but these errors were encountered:

andrewhowdencom · 2017-03-24T20:25:58Z

Depending on your hosts, you could use something like strace to introspect that process from the host NS (I'm like, 99% sure all processes are visible from the host NS)

feelobot · 2017-03-28T01:07:33Z

you could also use csysdig but I agree there should be logs

brancz · 2017-03-28T05:23:33Z

kube-state-metrics OOMing is likely due to the size of your cluster. It builds an in memory cache of all objects in kubernetes, so the larger your kubernetes instance, the larger your kube-state-metrics memory limit should be. Unfortunately we have not benchmarked this extensively to come up with a formula of how much memory for how many objects, so I recommend running it on a large node without a limit observe the memory usage and set the limit with a reasonable margin.

I'm not sure I understand what kind of logs you are expecting when an application OOMs and is killed by the supervisor.

Which version are you using? Because there should be at least some logs.

Resisty · 2017-03-31T02:12:56Z

Hi @brancz,

We're running image: gcr.io/google_containers/kube-state-metrics:v0.3.0. As for what kind of logs you are expecting: literally anything.

brancz · 2017-03-31T07:33:39Z

There were actually some changes in regard to logging in the latest release, the glog library was not properly configured. Can you upgrade to latest v0.4.1?

andyxning · 2017-04-05T09:22:48Z

@brancz

It builds an in memory cache of all objects in kubernetes, so the larger your kubernetes instance, the larger your kube-state-metrics memory limit should be.

This mainly refers to client-go cache?

brancz · 2017-04-05T09:56:18Z

@andyxning yes, the informers/informer-framework more specifically.

brancz · 2017-04-05T09:56:39Z

Does that solve your issue @Resisty ?

gianrubio · 2017-04-10T14:27:12Z

Same issue here, v0.4.1

My deployment file comes from prometheus-operator

$ dmesg
....
[1164660.539073] Memory cgroup out of memory: Kill process 3957 (kube-state-metr) score 2256 or sacrifice child
[1164660.545571] Killed process 3957 (kube-state-metr) total-vm:78820kB, anon-rss:49484kB, file-rss:17916kB, shmem-rss:0kB
[1164961.936488] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[1164968.003317] kube-state-metr invoked oom-killer: gfp_mask=0x24000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=999
[1164968.010199] kube-state-metr cpuset=c7524ef49ac3935401240a79c28b0a6df741d15eedb1d318ef185c283901d842 mems_allowed=0
[1164968.016745] CPU: 3 PID: 9143 Comm: kube-state-metr Not tainted 4.9.9-coreos-r1 #1
[1164968.019067] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
[1164968.019067]  ffffb97e440c3c50 ffffffffb431a933 ffffb97e440c3d78 ffff9615c1ec8000
[1164968.019067]  ffffb97e440c3cc8 ffffffffb420209e 0000000000000000 00000000000003e7
[1164968.019067]  ffff96154b4f6800 ffff96154b4f6800 0000000000000000 0000000000000001
[1164968.019067] Call Trace:
[1164968.019067]  [<ffffffffb431a933>] dump_stack+0x63/0x90
[1164968.019067]  [<ffffffffb420209e>] dump_header+0x7d/0x203
[1164968.019067]  [<ffffffffb418474c>] oom_kill_process+0x21c/0x3f0
[1164968.019067]  [<ffffffffb4184c1d>] out_of_memory+0x11d/0x4b0
[1164968.019067]  [<ffffffffb41f685b>] mem_cgroup_out_of_memory+0x4b/0x80
[1164968.019067]  [<ffffffffb41fc6d9>] mem_cgroup_oom_synchronize+0x2f9/0x320
[1164968.019067]  [<ffffffffb41f7390>] ? high_work_func+0x20/0x20
[1164968.019067]  [<ffffffffb4184fe6>] pagefault_out_of_memory+0x36/0x80
[1164968.019067]  [<ffffffffb40682bc>] mm_fault_error+0x8c/0x190
[1164968.019067]  [<ffffffffb4068b6f>] __do_page_fault+0x44f/0x4b0
[1164968.019067]  [<ffffffffb4068bf2>] do_page_fault+0x22/0x30
[1164968.019067]  [<ffffffffb45cfdb8>] page_fault+0x28/0x30
[1164968.079458] Task in /docker/c7524ef49ac3935401240a79c28b0a6df741d15eedb1d318ef185c283901d842 killed as a result of limit of /docker/c7524ef49ac3935401240a79c28b0a6df741d15eedb1d318ef185c283901d842
[1164968.090099] memory: usage 51200kB, limit 51200kB, failcnt 42
[1164968.093438] memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
[1164968.098076] kmem: usage 720kB, limit 9007199254740988kB, failcnt 0

$ kubectl get pods -n monitoring -o wide -l   app=kube-state-metrics
NAME                                  READY     STATUS             RESTARTS   AGE       IP              NODE
kube-state-metrics-1039847806-tzp7t   0/1       CrashLoopBackOff   15         59m       *    ip-*.eu-west-1.compute.internal
kube-state-metrics-1039847806-v3sn9   0/1       CrashLoopBackOff   15         59m

$ kubectl logs -n monitoring  -f kube-state-metrics-1039847806-tzp7t -p -f
I0410 14:16:05.550772       1 main.go:139] Using default collectors
I0410 14:16:05.551113       1 main.go:186] service account token present: true
I0410 14:16:05.551124       1 main.go:187] service host: https://**:443
I0410 14:16:05.551606       1 main.go:213] Testing communication with server
I0410 14:16:05.740980       1 main.go:218] Communication with server successful
I0410 14:16:05.741145       1 main.go:263] Active collectors: pods,nodes,resourcequotas,replicasets,daemonsets,deployments
I0410 14:16:05.741157       1 main.go:227] Starting metrics server: :8080

How I fixed ?

Changing memory limits from 50mi to 100mi

Related to kubernetes/kube-state-metrics#112 (comment)

chlunde · 2017-08-22T19:25:12Z

@brancz I also had to increase the memory to 80 MiB on a 6 node OpenShift cluster. Perhaps the memory settings in the deployment should be bumped?

kube-state-metrics/kubernetes/kube-state-metrics-deployment.yaml

Lines 56 to 57 in 469f73f

    
           - --memory=30Mi 
        
           - --extra-memory=2Mi

kube-state-metrics/kubernetes/kube-state-metrics-deployment.yaml

Lines 38 to 41 in 469f73f

    
             memory: 30Mi 
        
           requests: 
        
             cpu: 100m 
        
             memory: 30Mi

does not match README.md

Resource usage changes with the size of the cluster. As a general rule, you should allocate

200MiB memory
0.1 cores
For clusters of more than 100 nodes, allocate at least

2MiB memory per node

Also, having the default settings tight will be an issue when more objects types are added without updating the requirements.

brancz · 2017-08-23T09:10:02Z

I'm ok with bumping the request and limit. What do you think would be an appropriate start value then?

matthiasr · 2017-08-23T09:11:23Z

The ones we recommend in the README?

brancz · 2017-08-23T09:13:45Z

Yes I don't recall why we didn't do that in the first place.

matthiasr · 2017-08-23T09:15:05Z

We came up with those independent of #200 and didn't backport them. I'm working on it, it's an easy change.

smparkes · 2017-09-06T01:29:32Z

We're running on a pretty small cluster (50 pods) and kube-state-metrics is OOMing if I ask it to collect pods all the way up to 2G. Fine if I collect services (trying others). (This is 0.5.0 and 1.0.1.)

Sort of wonder how in blazes I would even begin to try to trace this ...

smparkes · 2017-09-06T01:53:08Z

Hrm ... quick follow up: I did dump the goroutine list when the process was at low and high mem. The number of goroutines appears to growing. But we've also been tracking anomalous latency in the API server (trying to get GOOG too look at that since we don't run it (GKE).) Maybe overlapping requests because of delays?

brancz · 2017-09-06T07:46:39Z

We don't today, but it's probably time to add pprof endpoints so we can do proper profiling to see what's happening.

andyxning · 2017-09-09T06:52:08Z

I ask it to collect pods all the way up to 2G. Fine if I collect services (trying others). (This is 0.5.0 and 1.0.1.)

@smparkes you mean that

with only collecting 50 pods, the memory will eat up to 2G?
you have tried with both 0.5.0 and 1.0.1?
what is the number of services objects?
what is you scrape interval configured in Prometheus?

Maybe overlapping requests because of delays?

This problem possibly be related with client-go cause kube-state-metrics store nothing in memory.

I did dump the goroutine list when the process was at low and high mem. The number of goroutines appears to growing.

Dump goroutine list with pprof?

andyxning · 2017-09-09T07:07:25Z

Agreed with @brancz , we need to add pprof for debug.

smparkes · 2017-09-09T20:47:44Z

We resolved the issue.
50 live pods. Thousands of errored job pods. (We're still not quite sure how those all accumulated / didn't get gc'd.)
So the delay of the API might not have been the root cause (though I still wonder about whether service does concurrent requests if some requests take longer than the polling interval ... not clear if the service should be hardened to that.)
It would be a "nice to have" to log info on the progress of collection to help debug things like this but the root cause was us ... and us not having monitoring on this, which is ironic :-)
Thanks guys!

andyxning · 2017-09-10T02:16:36Z

@brancz Does Prometheus scrape synchronously, i.e., it will wait the previous one to finish before start the next scrape?

@caesarxuchao Does client-go sync with apiserver synchronously, i.e., it will wait the previous one to finish before start the next scrape?

andyxning · 2017-09-10T02:25:18Z

@smparkes Actually the scrape logic is very simple and it works synchronously. IMO, adding the log about how many resource objects are analyzed can at some degree make the debug more easy.

I also have make a PR about adding pprof to

andyxning · 2017-09-12T21:07:55Z

@smparkes Added a PR about logging the collected resource objects number. Ref #254 .

Related to kubernetes/kube-state-metrics#112 (comment)

…openshift-4.17-kube-state-metrics OCPBUGS-34202: Updating kube-state-metrics-container image to be consistent with ART for 4.17

gianrubio added a commit to gianrubio/prometheus-operator that referenced this issue Apr 10, 2017

Increase memory limits to avoid OOMKilled

98fd101

Related to kubernetes/kube-state-metrics#112 (comment)

gianrubio mentioned this issue Apr 10, 2017

Increase memory limits to avoid OOMKilled prometheus-operator/prometheus-operator#280

Merged

gianrubio added a commit to gianrubio/prometheus-operator that referenced this issue Apr 10, 2017

Increase memory limits to avoid OOMKilled

8544119

Related to kubernetes/kube-state-metrics#112 (comment)

matthiasr mentioned this issue Aug 23, 2017

Match deployment resources to recommendation #238

Merged

brancz closed this as completed in #238 Aug 23, 2017

andyxning mentioned this issue Sep 12, 2017

add collected resource objects length log #254

Merged

brancz mentioned this issue Sep 15, 2017

kube-state-metrics consuming too much memory #257

Closed

metalmatze pushed a commit to metalmatze/kube-prometheus that referenced this issue Apr 12, 2019

Increase memory limits to avoid OOMKilled

78f5434

Related to kubernetes/kube-state-metrics#112 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No logs available #112

No logs available #112

Resisty commented Mar 24, 2017

andrewhowdencom commented Mar 24, 2017

feelobot commented Mar 28, 2017

brancz commented Mar 28, 2017

Resisty commented Mar 31, 2017

brancz commented Mar 31, 2017

andyxning commented Apr 5, 2017

brancz commented Apr 5, 2017

brancz commented Apr 5, 2017

gianrubio commented Apr 10, 2017

chlunde commented Aug 22, 2017

brancz commented Aug 23, 2017

matthiasr commented Aug 23, 2017 •

edited

Loading

brancz commented Aug 23, 2017

matthiasr commented Aug 23, 2017

smparkes commented Sep 6, 2017

smparkes commented Sep 6, 2017

brancz commented Sep 6, 2017

andyxning commented Sep 9, 2017

andyxning commented Sep 9, 2017

smparkes commented Sep 9, 2017 •

edited

Loading

andyxning commented Sep 10, 2017

andyxning commented Sep 10, 2017 •

edited

Loading

andyxning commented Sep 12, 2017

No logs available #112

No logs available #112

Comments

Resisty commented Mar 24, 2017

andrewhowdencom commented Mar 24, 2017

feelobot commented Mar 28, 2017

brancz commented Mar 28, 2017

Resisty commented Mar 31, 2017

brancz commented Mar 31, 2017

andyxning commented Apr 5, 2017

brancz commented Apr 5, 2017

brancz commented Apr 5, 2017

gianrubio commented Apr 10, 2017

How I fixed ?

chlunde commented Aug 22, 2017

brancz commented Aug 23, 2017

matthiasr commented Aug 23, 2017 • edited Loading

brancz commented Aug 23, 2017

matthiasr commented Aug 23, 2017

smparkes commented Sep 6, 2017

smparkes commented Sep 6, 2017

brancz commented Sep 6, 2017

andyxning commented Sep 9, 2017

andyxning commented Sep 9, 2017

smparkes commented Sep 9, 2017 • edited Loading

andyxning commented Sep 10, 2017

andyxning commented Sep 10, 2017 • edited Loading

andyxning commented Sep 12, 2017

matthiasr commented Aug 23, 2017 •

edited

Loading

smparkes commented Sep 9, 2017 •

edited

Loading

andyxning commented Sep 10, 2017 •

edited

Loading