Prometheus ServiceMonitor failing to scrape operator metrics served though kube-proxy HTTPS 8443 port #4764

slopezz · 2021-04-14T17:45:27Z

Bug Report

I'm using operator-sdk 1.5.0 and I'm trying to gather operator metrics without success.

What did you do?

Deployed default operator-sdk v1.5.0 with prometheus metrics enabled at kustomize config level (config/default/kustomization.yaml). I'm using kube-rbac-proxy:v0.5.0 because of issue #4684, but I don't think it affects.

It is being created the expected controller-manager deployment with kube-proxy metrics enabled at port 8443 (I just copy/paste the relevant parts of the deployed yaml):

kind: Deployment
apiVersion: apps/v1
metadata:
  name: prometheus-exporter-operator-controller-manager
  namespace: prometheus-exporter
   spec:
      containers:
        - name: kube-rbac-proxy
          image: 'gcr.io/kubebuilder/kube-rbac-proxy:v0.5.0'
          args:
            - '--secure-listen-address=0.0.0.0:8443'
            - '--upstream=http://127.0.0.1:8080/'
            - '--logtostderr=true'
            - '--v=10'
          ports:
            - name: https
              containerPort: 8443
              protocol: TCP
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
        - name: manager
...
          env:
            - name: ANSIBLE_GATHERING
              value: explicit
            - name: WATCH_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: 'metadata.annotations[''olm.targetNamespaces'']'
          imagePullPolicy: IfNotPresent
          terminationMessagePolicy: File
          image: 'quay.io/3scale/prometheus-exporter-operator:v0.3.0'
          args:
            - '--metrics-addr=127.0.0.1:8080'
            - '--enable-leader-election'
            - '--leader-election-id=prometheus-exporter-operator'
...

It is being created the expected metrics Service:

kind: Service
apiVersion: v1
metadata:
  name: prometheus-exporter-operator-controller-manager-metrics-service
  namespace: prometheus-exporter
  labels:
    control-plane: controller-manager
    operators.coreos.com/prometheus-exporter-operator.prometheus-exporter: ''
spec:
  ports:
    - name: https
      protocol: TCP
      port: 8443
      targetPort: https
  selector:
    control-plane: controller-manager
  clusterIP: 172.30.117.225
  type: ClusterIP
  sessionAffinity: None

It is being created the expected ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prometheus-exporter-operator-controller-manager-metrics-monitor
  namespace: prometheus-exporter
  labels:
    control-plane: controller-manager
spec:
  endpoints:
    - path: /metrics
      port: https
  selector:
    matchLabels:
      control-plane: controller-manager

What did you expect to see?

ServiceMonitor achieves to scrape operator metrics (an so, metric up=1).

What did you see instead? Under which circumstances?

Service monitor failing (metric up=0):

up{container="kube-rbac-proxy",endpoint="https",instance="10.129.2.246:8443",job="prometheus-exporter-operator-controller-manager-metrics-service",namespace="prometheus-exporter",pod="prometheus-exporter-operator-controller-manager-669f6fbdcc2jbm7",prometheus="openshift-user-workload-monitoring/user-workload",service="prometheus-exporter-operator-controller-manager-metrics-service"} | 0

Environment

Operator type:

/language ansible

Kubernetes cluster type: Openshift v4.6

$ operator-sdk version

operator-sdk version: "v1.5.0", commit: "98f30d59ade2d911a7a8c76f0169a7de0dec37a0", kubernetes version: "1.19.4", go version: "go1.15.5", GOOS: "linux", GOARCH: "amd64"

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.13-dispatcher", GitCommit:"fd22db44e150011eccc8729db223945384460143", GitTreeState:"clean", BuildDate:"2020-07-24T07:27:52Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0+bd9e442", GitCommit:"bd9e4421804c212e6ac7ee074050096f08dda543", GitTreeState:"clean", BuildDate:"2021-02-11T23:05:38Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

Possible Solution

N/A

Additional context

If I connect to the controller-manager, manager container, I can check the metrics served through manager protected port 8080 (only available at 127.0.0.1):

$ kubectl exec -it prometheus-exporter-operator-controller-manager-669f6fbdcc2jbm7 -c manager -- /bin/bash

bash-4.4$ curl 127.0.0.1:8080/metrics
# HELP ansible_operator_build_info Build information for the ansible-operator binary
# TYPE ansible_operator_build_info gauge
ansible_operator_build_info{commit="98f30d59ade2d911a7a8c76f0169a7de0dec37a0",version="v1.4.0+git"} 1
# HELP ansible_operator_reconcile_result Gauge of reconciles and their results.
# TYPE ansible_operator_reconcile_result gauge
ansible_operator_reconcile_result{GVK="monitoring.3scale.net/v1alpha1, Kind=PrometheusExporter",result="succeeded"} 6
# HELP ansible_operator_reconciles How long in seconds a reconcile takes.
# TYPE ansible_operator_reconciles histogram
ansible_operator_reconciles_bucket{GVK="monitoring.3scale.net/v1alpha1, Kind=PrometheusExporter",le="0.005"} 6
...

However, if I try to access to the port published through the kube-proxy port, it fails (both http/https schema), which I guess is what prometheus is trying to do with the deployed ServiceMonitor, so failing):

bash-4.4$ curl 127.0.0.1:8443/metrics
Client sent an HTTP request to an HTTPS server.


bash-4.4$ curl https://127.0.0.1:8443/metrics
curl: (60) SSL certificate problem: self signed certificate in certificate chain
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

Maybe there are 2 problems?

Current ServiceMonitor tries to scrape using HTTP schema, but the port is offering HTTPS?
In addition, although ServiceMonitor uses HTTPS schema (not the case), the certificate is selfsigned, and maybe prometheus would refuse it anyway?

The text was updated successfully, but these errors were encountered:

criscola · 2021-04-15T11:17:51Z

Hello, I had the exact same problem and it brought me a lot of headache. Try the following:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prometheus-exporter-operator-controller-manager-metrics-monitor
  namespace: prometheus-exporter
  labels:
    control-plane: controller-manager
spec:
  endpoints:
    - path: /metrics
      port: https
      scheme: https
      bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
      tlsConfig:
        insecureSkipVerify: true
  selector:
    matchLabels:
      control-plane: controller-manager

slopezz · 2021-04-15T14:38:44Z

Thanks for posting that solution @criscola, actually your suggestion makes total sense.

I have applied that change and deploy it:

$ git diff
diff --git a/config/prometheus/monitor.yaml b/config/prometheus/monitor.yaml
index 1b44d4f..a5bd8b1 100644
--- a/config/prometheus/monitor.yaml
+++ b/config/prometheus/monitor.yaml
@@ -11,6 +11,10 @@ spec:
   endpoints:
     - path: /metrics
       port: https
+      scheme: https
+      bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
+      tlsConfig:
+        insecureSkipVerify: true
   selector:
     matchLabels:
       control-plane: controller-manager



$ make deploy 
cd config/manager && /home/slopez/bin/kustomize edit set image controller=quay.io/3scale/prometheus-exporter-operator:v0.3.0
/home/slopez/bin/kustomize build config/manual | kubectl apply -f -
namespace/prometheus-exporter-operator-system created
customresourcedefinition.apiextensions.k8s.io/prometheusexporters.monitoring.3scale.net created
serviceaccount/prometheus-exporter-operator-controller-manager created
role.rbac.authorization.k8s.io/prometheus-exporter-operator-leader-election-role created
role.rbac.authorization.k8s.io/prometheus-exporter-operator-manager-role created
clusterrole.rbac.authorization.k8s.io/prometheus-exporter-operator-metrics-reader created
clusterrole.rbac.authorization.k8s.io/prometheus-exporter-operator-proxy-role created
rolebinding.rbac.authorization.k8s.io/prometheus-exporter-operator-leader-election-rolebinding created
rolebinding.rbac.authorization.k8s.io/prometheus-exporter-operator-manager-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-exporter-operator-proxy-rolebinding created
service/prometheus-exporter-operator-controller-manager-metrics-service created
deployment.apps/prometheus-exporter-operator-controller-manager created
servicemonitor.monitoring.coreos.com/prometheus-exporter-operator-controller-manager-metrics-monitor created

But now prometheus cannot scrape the operator, before I had prometheus up=0 because the target was down, but now nothing, like if prometheus is ignoring that ServiceMonitor after applying that 3 changes.

It might be caused by a total unrelated problem regarding the monitoring stack I'm using, which is the openshift user-workload-monitoring stack (let's say, the official way of monitoring user workloads on openshift).

If I get into a prometheus pod from openshift user-workload-monitoring stack (for example container config-reloaded), and I execute the curl that prometheus should use with configured the Servicemonitor`, it works fine and I can get the metrics:

$ oc project openshift-user-workload-monitoring
Now using project "openshift-user-workload-monitoring" on server "https://api.....net:6443".

$ oc get pods
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-849fdfdcb5-ktqjd   2/2     Running   0          29d
prometheus-user-workload-0             4/4     Running   1          29d
prometheus-user-workload-1             4/4     Running   1          2d22h
thanos-ruler-user-workload-0           3/3     Running   3          29d
thanos-ruler-user-workload-1           3/3     Running   3          29d


$ kubectl exec -it prometheus-user-workload-1 -c config-reloader -- /bin/bash

bash-4.4$ cat /var/run/secrets/kubernetes.io/serviceaccount/token
qy......xAba
 
bash-4.4$ curl --insecure https://prometheus-exporter-operator-controller-manager-metrics-service.prometheus-exporter-operator-system.svc.cluster.local:8443/metrics -H "Authorization: Bearer qy......xAba"
# HELP ansible_operator_build_info Build information for the ansible-operator binary
# TYPE ansible_operator_build_info gauge
ansible_operator_build_info{commit="98f30d59ade2d911a7a8c76f0169a7de0dec37a0",version="v1.4.0+git"} 1
# HELP controller_runtime_active_workers Number of currently used workers per controller
# TYPE controller_runtime_active_workers gauge
controller_runtime_active_workers{controller="prometheusexporter-controller"} 0
....

However it seems that the prometheus is ignoring the ServiceMonitor because of having the bearerTokenFile field defined. If I remove that field from the ServiceMonitor, prometheus scrape fails (normal), but a least I can see hat prometheus is trying to scrape it with giving up=0, but once I add the bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token there is nothing.

slopezz · 2021-04-16T15:40:48Z

I have checked that new operator-sdk:v1.6.1 (from yesterday) already implements the suggested ServiceMonitor modifications at #4680:

+      scheme: https
+      bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
+      tlsConfig:
+        insecureSkipVerify: true

And I have found that my particular problem is caused because using Openshift User Workload Monitoring stack, Iif I get into on prometheus-operator pod logs, I can see the following warning:

level=warn ts=2021-04-15T14:55:59.541363474Z caller=operator.go:1636 component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via bearer token file which Prometheus specification prohibits" servicemonitor=prometheus-exporter-operator-system/prometheus-exporter-operator-controller-manager-metrics-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload

So the ServiceMonitor with the bearerTokeFile field is ignored (skipped), because of prometheus configuration, it is prohibited.

After commenting it with Openshift monitoring team, it is skiped because of arbitraryFSAccessThroughSMs, which it is set to false to limit potential security issues (to not let scraped targets to get access to the prometheus service account's token).

They suggested me to maybe use bearerTokeSecret instead of bearerTokeFile, but after doing myself a couple of tests (and without being an expert on the matter), I see a couple of issues:

From one side, bearerTokenSecret requires the name of a secret, and secrets holding the tokens from ServiceAccounts have random objects names (so there is no way to know the name of the secret before creating the ServiceAccount, and operator-sdk requires predictability of object names for the scaffolding)
In addition, normally operator use Roles (not ClusterRoles), and it seems that the permission required can only be added to ClusterRoles (- nonResourceURLs: - /metrics / verbs: - get)

In addition, Openshift monitoring team told me that we should bear in mind that using bearer tokens for metrics authn puts additional load on the API server and they are looking at replacing this by client TLS auth in the future (it's being discussed here: openshift/enhancements#701)

For the moment I will just remove the proxy in front the operator (to be able to access operator metrics without any problem using the Openshift UWM).

So from my point of view issue can be closed now (there is no problem with operator-sdk), but I will let operator-sdk team to decide what to do, because current ServiceMonitor definition won't work on Openshift User Workload Monitoring (the official monitoring stack from Openshift), and maybe I'm missing something, can you think in a way to authenticate to the metrics endpoint that not requires a cluster role or accessing a generated secret?

camilamacedo86 · 2021-04-19T18:40:36Z

just to share for who will be able to check out and help here. The PR change the related scaffolds so might be valid to check
kubernetes-sigs/kubebuilder#2065 this with the latest scaffold as well.

openshift-bot · 2021-07-18T20:21:02Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-08-18T00:09:35Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-09-17T02:41:19Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-09-17T02:41:22Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…etrics see more detail in issue operator-framework/operator-sdk#4764 Signed-off-by: Abdul Hameed <[email protected]>

Using a ServiceMonitor with the bearerTokenFile parameter set causes the ServiceMonitor to be rejected by the OpenShift user monitoring stack ( operator-framework/operator-sdk#4764 ). As there is nothing sensitive in the mondoo-operator metrics, just expose them directly to allow metrics to work under the built-in OpenShift user metrics monitoring stack. Add the ability to set some labels on the ServiceMonitor to allow a functional metrics collection with an out-of-the-box prometheus deployed as configured in https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack . Change the kustomize generation so that the kube-rbac-proxy sidecar container is no longer defined. It really only exists to protect metrics. Introduce new Service to expose the new metrics ports. Patch the default Deployment to expose the metrics port. A side benefit of this is that you don't need to specify the container name when displaying logs for mondoo-operator as there is now only a single container. Signed-off-by: Joel Diaz <[email protected]>

* Expose metrics for prometheus * Added Status Signed-off-by: Harsha <[email protected]> * migrate to using new MondooOperatorConfig for metrics Rather than put the metrics config into the MondooAuditConfig (which is really for configuring monitoring-specific settings), create a new MondooOperatorConfig CRD which is cluster-scoped which can be used to configure operator-wide behavor of the mondoo-operator. In a cluster with multiple MondooAuditConfigs, it makes no sense to have one resource with metrics.enabled = true and a different one with metrics.enabled = false. So just allow a single MondooOperatorConfig to hold the cluster-wide metrics configuration for the mondoo-operator. Take the existing ServiceMonitor handling code and call it from the new mondoooperatorconfig controller. Extend the MondooOperatorConfig status to hold a list of conditions, and use this to communicate status for when metrics is enabled, but we couldn't find Prometheus installed on the cluster. The conditions handing is written so that a Condition only appears initially if the Condition.Status is set to True. This means that if you enable metrics, and Prometheus is found, there will be no Condition[].Type = PrometheusMissing with .Status = False. Only when Prometheus is missing will the condition be populated, and of course if Prometheus transitions from Missing to Found, then the Condition will be updated to show .Type = PrometheusMissing .Status = False. Signed-off-by: Joel Diaz <[email protected]> * move to http metrics Using a ServiceMonitor with the bearerTokenFile parameter set causes the ServiceMonitor to be rejected by the OpenShift user monitoring stack ( operator-framework/operator-sdk#4764 ). As there is nothing sensitive in the mondoo-operator metrics, just expose them directly to allow metrics to work under the built-in OpenShift user metrics monitoring stack. Add the ability to set some labels on the ServiceMonitor to allow a functional metrics collection with an out-of-the-box prometheus deployed as configured in https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack . Change the kustomize generation so that the kube-rbac-proxy sidecar container is no longer defined. It really only exists to protect metrics. Introduce new Service to expose the new metrics ports. Patch the default Deployment to expose the metrics port. A side benefit of this is that you don't need to specify the container name when displaying logs for mondoo-operator as there is now only a single container. Signed-off-by: Joel Diaz <[email protected]> Co-authored-by: Joel Diaz <[email protected]>

…etrics see more detail in issue operator-framework/operator-sdk#4764 Signed-off-by: Abdul Hameed <[email protected]>

openshift-ci-robot added the language/ansible Issue is related to an Ansible operator project label Apr 14, 2021

slopezz mentioned this issue Apr 19, 2021

Disable kube-rbac-proxy from prometheus-exporter-operator controller-manager 3scale-ops/prometheus-exporter-operator#26

Merged

kensipe added kind/documentation Categorizes issue or PR as related to documentation. triage/needs-information Indicates an issue needs more information in order to work on it. labels Apr 19, 2021

kensipe added this to the Backlog milestone Apr 19, 2021

mumoshu mentioned this issue Jun 23, 2021

Instruct ServiceMonitor to connect using https for controller actions/actions-runner-controller#625

Merged

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2021

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 18, 2021

tumido mentioned this issue Sep 1, 2021

feat: Enable insecure metrics thoth-station/meteor-operator#32

Merged

openshift-ci bot closed this as completed Sep 17, 2021

This was referenced Nov 22, 2021

Fix operator metrics servicemonitor 3scale/3scale-operator#694

Closed

Delete kube-rbac-proxy container from controller-manager 3scale/3scale-operator#692

Merged

Delete kube-rbac-proxy container from controller-manager 3scale/apicast-operator#165

Merged

csibbitt mentioned this issue Dec 7, 2021

Observability Strategy infrawatch/documentation#347

Merged

joelddiaz mentioned this issue Mar 18, 2022

⭐️ Expose Metrics for Mondoo Operator mondoohq/mondoo-operator#186

Merged

goern mentioned this issue Nov 3, 2023

PrometheusOperatorRejectedResources operate-first/alerts#29235

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus ServiceMonitor failing to scrape operator metrics served though kube-proxy HTTPS 8443 port #4764

Prometheus ServiceMonitor failing to scrape operator metrics served though kube-proxy HTTPS 8443 port #4764

slopezz commented Apr 14, 2021

criscola commented Apr 15, 2021 •

edited

Loading

slopezz commented Apr 15, 2021

slopezz commented Apr 16, 2021

camilamacedo86 commented Apr 19, 2021

openshift-bot commented Jul 18, 2021

openshift-bot commented Aug 18, 2021

openshift-bot commented Sep 17, 2021

openshift-ci bot commented Sep 17, 2021

Prometheus ServiceMonitor failing to scrape operator metrics served though kube-proxy HTTPS 8443 port #4764

Prometheus ServiceMonitor failing to scrape operator metrics served though kube-proxy HTTPS 8443 port #4764

Comments

slopezz commented Apr 14, 2021

Bug Report

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Environment

Possible Solution

Additional context

criscola commented Apr 15, 2021 • edited Loading

slopezz commented Apr 15, 2021

slopezz commented Apr 16, 2021

camilamacedo86 commented Apr 19, 2021

openshift-bot commented Jul 18, 2021

openshift-bot commented Aug 18, 2021

openshift-bot commented Sep 17, 2021

openshift-ci bot commented Sep 17, 2021

criscola commented Apr 15, 2021 •

edited

Loading