Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus workload loader #266

Open
wants to merge 26 commits into
base: main
Choose a base branch
from
Open

Conversation

LeaveMyYard
Copy link
Contributor

@LeaveMyYard LeaveMyYard commented Apr 22, 2024

  • Implemented ability to select an alternative workload loader
  • Implemented Prometheus workload loader
  • Refactored Kube API workload loader: moved separate logic (for each Kind) to a separate class
  • List clusters on centralized prometheus
  • Scanning multiple clusters in prometheus mode?
  • Test running from robusta UI
  • Test prometheus mode on centralized prometheus
  • A lot of testing
  • Fix tests (mocks are now broken)

Testing issues found:

  • For some reason auto-discovery does not work
  • Prometheus mode does not gather HPA data
  • Something is now broken with HPAKey

@LeaveMyYard LeaveMyYard self-assigned this Apr 22, 2024
@LeaveMyYard LeaveMyYard mentioned this pull request Apr 26, 2024
3 tasks
@LeaveMyYard LeaveMyYard requested a review from aantn April 30, 2024 15:56
@LeaveMyYard LeaveMyYard marked this pull request as ready for review April 30, 2024 15:56
@deutschj
Copy link

When testing using the prometheus mode on a centralized Prometheus (VictoriaMetrics), I encountered the following errors.
I specified the labels --prometheus-label and --prometheus-cluster-label according to the docs:

python krr.py simple -p https://vmauth.xxx.com --prometheus-auth-header "<redacted>" --prometheus-label cluster -l k8s0-q --mode prometheus

image

To me it seems that the query includes the wrong label here, and should be avg by(cluster) instead of using the cluster name.

The name is misleading. Before this change we had both --prometheus-cluster-label and --prometheus-label which referred to very different things, leading to a bug in the code (to be fixed in the next commit).

We still support -l and have added support for  "--prometheus-cluster-value" which is what `-l` really represents.
@@ -430,7 +430,7 @@ If your Prometheus monitors multiple clusters we require the label you defined f
For example, if your cluster has the Prometheus label `cluster: "my-cluster-name"`, then run this command:

```sh
krr.py simple --prometheus-label cluster -l my-cluster-name
krr.py simple --prometheus-cluster-key cluster -l my-cluster-name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aantn I think this requires a change on the robusta-runner as well
see here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arikalon1 it should be fine. I'm not deprecating --prometheus-label, just adding another option with a name that makes more sense.

I did deprecate --prometheus-cluster-label but the runner doesn't pass that by default.

@@ -138,15 +138,16 @@ def run_strategy(
),
prometheus_cluster_label: Optional[str] = typer.Option(
None,
"--prometheus-cluster-label",
"--prometheus-cluster-value",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this one I did change but if I understand the runner code correctly, it uses -l which is still supported and equivalent to this flag. So no change needed.

@aantn
Copy link
Contributor

aantn commented Jun 14, 2024

Hi @deutschj,
You're correct of course! I've pushed a fix. Can you please test?

@aantn
Copy link
Contributor

aantn commented Jun 14, 2024

@deutschj please note that I changed the name of the flags a little -- see --help if necessary.

@deutschj
Copy link

@aantn Sure, thanks a lot for working on the bugfix!

Whett testing with the options --prometheus-cluster-key and -l I encountered the following error:

krr git:(prometheus-workload-loader) python krr.py simple --prometheus-url https://vmauth.xxx.com --prometheus-auth-header "<redacted>" --mode prometheus --prometheus-cluster-key cluster -l k8s0-q

Running Robusta's KRR (Kubernetes Resource Recommender) 1.8.2-dev
Using strategy: Simple
Using formatter: table
A newer version of KRR is available: v1.11.0

[08:22:56] INFO     Connecting using Prometheus, will load the kubeconfig.
           INFO     Using Prometheus at https://vmauth.xxx.com
           INFO     Prometheus found
           INFO     Prometheus connected successfully            
[08:22:57] INFO     Clusters available: k8s0-q, <cluster1>, <cluster2>, [...]
           CRITICAL Cannot scan multiple clusters for this prometheus, Rerun with the flag `-c <cluster>` where <cluster> is one of ['k8s0-q', '<cluster1>', '<cluster2>', [...]]

However, when providing only the -l option without the --prometheus-cluster-key, KRR starts to generate recommendations for the selected cluster. When generating, the following warnings are displayed for every existing workload:

WARNING  Prometheus returned no PercentileCPULoader metrics for StatefulSet xxx
WARNING  Prometheus returned no MaxMemoryLoader metrics for StatefulSet xxx
WARNING  Prometheus returned no CPUAmountLoader metrics for StatefulSet xxx
WARNING  Prometheus returned no MemoryAmountLoader metrics for StatefulSet xxx
INFO     Calculated recommendations for StatefulSet xxx (using 4 metrics)

And the table with the generated resource recommendations in the end is empty though.
So looks like now getting the existing workloads in the cluster works correctly, but getting the corresponding metrics from the centralized Prometheus doesn't yet.

@aantn
Copy link
Contributor

aantn commented Jun 28, 2024

@deutschj does it work if you provide --prometheus-cluster-key, -l, and -c with -c set to the same value as -l?

@deutschj
Copy link

deutschj commented Jul 2, 2024

@aantn Unfortunately not, no - this yields the same error as above, CRITICAL Cannot scan multiple clusters for this prometheus, Rerun with the flag '-c <cluster>' where <cluster> is one of [...]

@Pionerd
Copy link

Pionerd commented Jul 8, 2024

Would love to see this merged, using KRR from this feature branch for some time now

@aantn
Copy link
Contributor

aantn commented Jul 8, 2024

Hey, we're planning to get it merged, but no exact ETA yet.

@wad-hongsumin
Copy link

hello. Are there any plans for this branch to be merged?

@arikalon1
Copy link
Contributor

hey @wad-hongsumin

We do plan to merge it, hopefully soon

@wad-hongsumin
Copy link

@arikalon1
thank you. I'm hoping this PR gets merged quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants