-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propagate metrics from the drivers #293
Comments
Couple questions:
Another option is to install prometheus push gateway and let drivers to push metrics directly to prometheus. |
We don't have container metrics since we use So yes, for now only metrics exposed by the driver.
To be honest, I don't remember. Need to check SDK for this. Probably not yet, because we had no way to access those metrics.
This is exactly what I wanted to avoid here, because instead of running a single driver container we will have to maintain a few more, e.g. prometheus/proxy/whatever. I think it should be possible to aggregate metrics on bblfshd directly from driver's HTTP endpoint.
Again, I would like to avoid it. This requires pushing more configs to drivers, and drivers will need to communicate with external services, which is not ideal.
HTTP, as prometheus does. The format returned by the endpoint is text-based and simple, I beleive there is a public package in Prometheus to parse and modify metrics in this format. So we can read from HTTP endpoint of each driver, add a
Sure, but it's exactly the purpose of bblfshd - to manage and monitor drivers during their lifetime. And metrics play an important part in the monitoring, so it makes sense to embed this functionality in bblfshd. Or give up on |
So, why don't we expose prometheus on each driver. When you install a new driver it will get a random available port on bblfshd which will be mapped to container's (driver's) :2112 (prometheus). |
It would be ideal, I guess, but we may create/kill drivers at a very high rate, so updating a set of Prometheus data sources may end up being harder than aggregating. Also, it may be hard to set up because bblfshd will need to access Prometheus and change its configuration. I don't think Infra will like it. And aggregation is not as hard as you may think. It should not be harder than reading a bunch of JSONs, merging those and serving on a separate endpoint. With "JSON" replaced by "Prometheus text format". To simplify things we may even not expose the results to Prometheus lib and make a separate HTTP endpoint. For example, bblfshd will serve it's own metrics on |
I think adding a new drivers does not happen very often. Although, registering new node a new node in prom. should be a big deal. Moreover if prom. tries to hit non-existing driver it should auto-drop this node. |
I'm inclined to aim for the solution that requires the least additional coupling. If we look at the set of all running drivers for a given language as a single job regardless which daemon is running them, I think we get more or less the same thing as if we export them individually, right? Unlike logs, where identity matters, I think we don't really care which shard produced a particular sample, as long as we know which language (and possibly which driver version) it came from. (And even then—the instance could generate a UUID at startup to use as a throwaway instance ID if we for some reason needed to know that) |
@kuba-- It happens regularly because of the scaling policy in bblfshd. Each second it may kill or run another driver instance. And if you have N clients <= the max number of drivers, one client will always lag a bit and bblfshd may kill/start a driver each policy tick. I've tried to prevent this in a few previous PRs, but I think it may still happen under some circumstances. I agree with @creachadair that the less coupling - the better. The solution I mentioned (aggregation) does exactly this - from the perspective of Prom there is only a single daemon, so no additional configs required. And bblfshd concatenates all the metrics from all the drivers to present it to Prom with different driver IDs. In fact, if we were running only a single language instance per daemon, we could have just concatenated the metrics - the Prom text format allows that. But we can't do this when there are multiple instances, since metrics will collide. Or so I though initially. But @creachadair made a good point, we can just generate IDs randomly on the driver's side and let bblfshd concatenate metrics, instead of parsing, assigning IDs and re-encoding them back. This should work well, as far as I know. |
How about following design:
|
That design seems mostly reasonable, but I don't understand the purpose of separately aggregating the driver metrics locally in the daemon. If the drivers were going to export a prometheus instance anyway, we could have the global prometheus poll them directly. I realize the way it's set up right now that would not work, because of how we run the drivers in libcontainer. I think we mostly agreed on committing to Docker to manage our containers, though; and in that case we would no longer need a separate aggregation path—we could export the driver metric services directly. Was that the main motivation for the separate aggregation step? |
Partly yes (as I understood correctly @dennwc concerns). On the other hand, we create new and kill old drivers. Drivers run in containers and have to expose some random port for prometheus, on the host. |
Agree on switching to docker. The main reason why I insisted on aggregating metrics from drivers in the daemon is because it's the only way to expose metrics without additional configs. So if we can switch to Docker, this removes any concerns on my side - it's much easier to let Prometheus to pull all metrics from all containers (including drivers) and be done with it. |
Currently,
bblfshd
exposes only its own metrics. Ideally, we should expose metrics from containers as well. Since we are not directly exposing any ports from driver containers, we will have to aggregate metrics manually frombblfshd
and expose them as well.The text was updated successfully, but these errors were encountered: