Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support metrics-based workload discovery #59

Open
lujiajing1126 opened this issue Jun 1, 2023 · 10 comments
Open

Support metrics-based workload discovery #59

lujiajing1126 opened this issue Jun 1, 2023 · 10 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@lujiajing1126
Copy link

Is your feature request related to a problem? Please describe.

In some cases, we want to run the KRR program locally. But for the security consideration, the API server of the Kubernetes cluster cannot be accessed outside of the cluster.

So we can use the Prometheus-based workload discovery if kube-state-metrics is installed.

Describe the solution you'd like

We can do the workload based discovery with the following steps,

  1. List Deployments together with their ReplicaSets,
replicasets = await self.metrics_loader.loader.query("count by (namespace, owner_name, replicaset) (kube_replicaset_owner{"
                                               f'namespace=~"{ns}", '
                                               'owner_kind="Deployment"})')
  1. List Pods from a group of ReplicaSets
# owner_name is ReplicaSet names
pods = await self.metrics_loader.loader.query("count by (owner_name, replicaset, pod) (kube_pod_owner{"
                                               f'namespace="{namespace}", '
                                               f'owner_name=~"{owner_name}", '
                                               'owner_kind="ReplicaSet"})')
  1. List containers from Pods got from step (2)
containers = await self.metrics_loader.loader.query("count by (container) (kube_pod_container_info{"
                                               f'namespace="{namespace}", '
                                               f'pod=~"{pod_selector}"'
                                               "})")
  1. Build K8sObjectData for containers got from step (3)
async def __build_from_owner(self, namespace: str, app_name: str, containers: List[str], pod_names: List[str]) -> List[K8sObjectData]:
        return [
            K8sObjectData(
                cluster=None,
                namespace=namespace,
                name=app_name,
                kind="Deployment",
                container=container_name,
                allocations=await self.__parse_allocation(namespace, "|".join(pod_names), container_name), # find 
                pods=[PodData(name=pod_name, deleted=False) for pod_name in pod_names], # list pods
            )
            for container_name in containers
        ]

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@danielhass
Copy link

In our team we would love to see this option as well. We have played around with the newly introduced -l option that is currently not released and only available on the main branch (introduced via #70). However we noticed that the workload discovery seems to be done via the Kubernetes API also in this case.

Our setup contains a central monitoring cluster where we would like to utilize krr to produce recommendations for the different clusters connected to this central monitoring solution. A metric-based workload discovery would allow us not to require Kubernetes API access to every downstream cluster that is reporting metrics to the central solution. As shown above the workload information should be extractable via the stored metrics with reasonable effort.

@lujiajing1126
Copy link
Author

lujiajing1126 commented Jun 20, 2023

In our team we would love to see this option as well. We have played around with the newly introduced -l option that is currently not released and only available on the main branch (introduced via #70). However we noticed that the workload discovery seems to be done via the Kubernetes API also in this case.

Our setup contains a central monitoring cluster where we would like to utilize krr to produce recommendations for the different clusters connected to this central monitoring solution. A metric-based workload discovery would allow us not to require Kubernetes API access to every downstream cluster that is reporting metrics to the central solution. As shown above the workload information should be extractable via the stored metrics with reasonable effort.

You can probably try my fork branch, https://github.com/lujiajing1126/krr/tree/support-prom-discovery.

We've already run this branch several times to provide resource recommendations for our production cluster. (maybe you need some tweak to the formatter)

Just

$ python krr.py simple -n <namespace> -p <prometheus-url>

is enough.

@danielhass
Copy link

Hey @lujiajing1126, thank you for your work and pointing us to your fork. The results so far look verify promising. We were able to generate recommendations running against our central monitoring cluster without problems so far.

Are you planning to contribute your changes back to krr? If you need any input or further testing which we can help with feel free to ping me!

@lujiajing1126
Copy link
Author

Hey @lujiajing1126, thank you for your work and pointing us to your fork. The results so far look verify promising. We were able to generate recommendations running against our central monitoring cluster without problems so far.

Are you planning to contribute your changes back to krr? If you need any input or further testing which we can help with feel free to ping me!

Thanks for your feedback! Sure, I am glad to contribute this feature back to the upstream. Since the branch contains several patches except the solution to this issue,

I would like to split it into several Pull Requests.

But so far I did not receive any code review or comments. It seems maintainers are not actively considering contributions from the community.

@aantn
Copy link
Contributor

aantn commented Jun 21, 2023

@lujiajing1126 hey, we appreciate the contribution and will review soon.

I'm sorry about the delay. We're a small team, so it takes sometimes takes more time than we like.

@LeaveMyYard
Copy link
Contributor

@lujiajing1126 Hey! I really like your idea and approach, and I am currently refactoring the way KRR is working with prometheus.

But there is one thing that came to my mind about removing the usage of cluster API: we have an auto-discovery feature, that makes life easy for lots of people not making them to search for the link. But we can leave that I think, it will just not work when you don't have access, making you to provide the uri.

But there is one more thing I am thinking about: Will it be possible to get data about CRDs using your approach, WDYT? (#65)

@LeaveMyYard
Copy link
Contributor

In our team we would love to see this option as well. We have played around with the newly introduced -l option that is currently not released and only available on the main branch (introduced via #70). However we noticed that the workload discovery seems to be done via the Kubernetes API also in this case.

Our setup contains a central monitoring cluster where we would like to utilize krr to produce recommendations for the different clusters connected to this central monitoring solution. A metric-based workload discovery would allow us not to require Kubernetes API access to every downstream cluster that is reporting metrics to the central solution. As shown above the workload information should be extractable via the stored metrics with reasonable effort.

@Avi-Robusta you are working with centralized monitoring, could you also join the discussion?

@LeaveMyYard LeaveMyYard added enhancement New feature or request help wanted Extra attention is needed labels Jun 23, 2023
@lujiajing1126
Copy link
Author

lujiajing1126 commented Jun 23, 2023

But there is one thing that came to my mind about removing the usage of cluster API: we have an auto-discovery feature, that makes life easy for lots of people not making them to search for the link. But we can leave that I think, it will just not work when you don't have access, making you to provide the uri.

Yes. I think so. We can generally provide two ways of workload discovery,

  1. Use Kubernetes API Server if provided, and automatically discover in-cluster Prom-compatible metrics server,
  2. Fallback to fully Prom-based discovery if only --prometheus-url is given.

But there is one more thing I am thinking about: Will it be possible to get data about CRDs using your approach, WDYT? (#65)

I am not familiar with ArgoCD. We need some investigation. But I think it is prevalent: advanced Kubernetes workload needs their own controller, e.g. OpenKruise.

And I would like to contribute my branch back to the upstream. (probably I need some more work to polish the code. I guess I could finish next week)

Again, thanks to your team for the hardworking to bring us this great product!

@LeaveMyYard
Copy link
Contributor

Yes. I think so. We can generally provide two ways of workload discovery,

  1. Use Kubernetes API Server if provided, and automatically discover in-cluster Prom-compatible metrics server,
  2. Fallback to fully Prom-based discovery if only --prometheus-url is given.

We agree that making it an optional behaviour is a good choise. Maybe by adding --discovery-method flag with "api-server" by default or "prometheus" as an option. With that approach we will not have to remove current implementation that works fine for mostly everyone

@lujiajing1126 lujiajing1126 mentioned this issue Jul 3, 2023
3 tasks
@aantn
Copy link
Contributor

aantn commented May 7, 2024

Hey all, just to update we have an updated PR with a new mode where KRR can operate only based on Prometheus data. So this will be supported in the near future. Please help by testing the PR!

#266

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants