Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate metrics from the drivers #293

Open
dennwc opened this issue May 15, 2019 · 11 comments
Open

Propagate metrics from the drivers #293

dennwc opened this issue May 15, 2019 · 11 comments

Comments

@dennwc
Copy link
Member

dennwc commented May 15, 2019

Currently, bblfshd exposes only its own metrics. Ideally, we should expose metrics from containers as well. Since we are not directly exposing any ports from driver containers, we will have to aggregate metrics manually from bblfshd and expose them as well.

@kuba-- kuba-- self-assigned this Sep 20, 2019
@kuba--
Copy link
Member

kuba-- commented Sep 20, 2019

Couple questions:

  • do we want to expose pure container's metrics or mainly driver's metrics?
  • do drivers have/collect any prometheus metrics?
  • if we want to propagate metrics from drivers and we don't want to expose any new port (and for instance install push gateway), maybe we can go with following approach:
    We will have another sock file, on mounted volume (like we have GRPCSocket = "rpc.sock" on driver, but this one can be metrics.sock pure UDS). Each driver will have own prometheus running on this socket (instead of aggregating anything on bblfshd), but bblfshd will listen on extra port (like we have http.ListenAndServe(*metrics.address, promhttp.Handler()). This extra port will be used by proxy server between bblfshd and drivers. It will proxy every driver's metrics.sock. In other words when prometheus pulls bblfshd metrics it will also try to pull metrics from the extra proxy server. An extra proxy server will try to pull/copy metrics from all drivers from metrics.sock files.
    WDYT?

Another option is to install prometheus push gateway and let drivers to push metrics directly to prometheus.
I don't see any reason why bblfshd needs to aggregate any data from all drivers. Over what protocol?
Drivers work in own pace, so I don't know if it makes sense utilize bblfshd. I hope simple io.Copy between sockets may work for us.

@kuba-- kuba-- removed their assignment Sep 20, 2019
@dennwc
Copy link
Member Author

dennwc commented Sep 20, 2019

do we want to expose pure container's metrics or mainly driver's metrics?

We don't have container metrics since we use libcontainer. We'll need to implement them on our own, if needed.

So yes, for now only metrics exposed by the driver.

do drivers have/collect any prometheus metrics?

To be honest, I don't remember. Need to check SDK for this. Probably not yet, because we had no way to access those metrics.

Each driver will have own prometheus running on this socket

This is exactly what I wanted to avoid here, because instead of running a single driver container we will have to maintain a few more, e.g. prometheus/proxy/whatever. I think it should be possible to aggregate metrics on bblfshd directly from driver's HTTP endpoint.

let drivers to push metrics directly to prometheus.

Again, I would like to avoid it. This requires pushing more configs to drivers, and drivers will need to communicate with external services, which is not ideal.

Over what protocol?

HTTP, as prometheus does. The format returned by the endpoint is text-based and simple, I beleive there is a public package in Prometheus to parse and modify metrics in this format. So we can read from HTTP endpoint of each driver, add a container=xxx tag to each metric, and re-expose them via bblfshd's own prometheus endpoint (by registering those metrics in prometheus package?).

Drivers work in own pace, so I don't know if it makes sense utilize bblfshd.

Sure, but it's exactly the purpose of bblfshd - to manage and monitor drivers during their lifetime. And metrics play an important part in the monitoring, so it makes sense to embed this functionality in bblfshd.

Or give up on libcontainer, use Docker and let Prometheus gather metrics from all containers automatically.

@kuba--
Copy link
Member

kuba-- commented Sep 20, 2019

So, why don't we expose prometheus on each driver. When you install a new driver it will get a random available port on bblfshd which will be mapped to container's (driver's) :2112 (prometheus).
bblfshd will also register another endpoint in the main, infra prometheus.
No aggregation needed, no need to switch to docker, etc.Kind of transparent solution.

@dennwc
Copy link
Member Author

dennwc commented Sep 20, 2019

It would be ideal, I guess, but we may create/kill drivers at a very high rate, so updating a set of Prometheus data sources may end up being harder than aggregating. Also, it may be hard to set up because bblfshd will need to access Prometheus and change its configuration. I don't think Infra will like it.

And aggregation is not as hard as you may think. It should not be harder than reading a bunch of JSONs, merging those and serving on a separate endpoint. With "JSON" replaced by "Prometheus text format".

To simplify things we may even not expose the results to Prometheus lib and make a separate HTTP endpoint. For example, bblfshd will serve it's own metrics on /metrics (via promhttp.Handler()) and additionally serve driver metrics on /metrics-drivers (by serving a raw buffer of concatenated metrics in Prometheus format).

@kuba--
Copy link
Member

kuba-- commented Sep 22, 2019

I think adding a new drivers does not happen very often. Although, registering new node a new node in prom. should be a big deal. Moreover if prom. tries to hit non-existing driver it should auto-drop this node.
Anyway, this solutions sounds more elegant comparing to aggregation-proxy functionality in bblfshd.

@creachadair
Copy link
Contributor

I'm inclined to aim for the solution that requires the least additional coupling. If we look at the set of all running drivers for a given language as a single job regardless which daemon is running them, I think we get more or less the same thing as if we export them individually, right? Unlike logs, where identity matters, I think we don't really care which shard produced a particular sample, as long as we know which language (and possibly which driver version) it came from. (And even then—the instance could generate a UUID at startup to use as a throwaway instance ID if we for some reason needed to know that)

@dennwc
Copy link
Member Author

dennwc commented Sep 25, 2019

I think adding a new drivers does not happen very often

@kuba-- It happens regularly because of the scaling policy in bblfshd. Each second it may kill or run another driver instance. And if you have N clients <= the max number of drivers, one client will always lag a bit and bblfshd may kill/start a driver each policy tick. I've tried to prevent this in a few previous PRs, but I think it may still happen under some circumstances.

I agree with @creachadair that the less coupling - the better. The solution I mentioned (aggregation) does exactly this - from the perspective of Prom there is only a single daemon, so no additional configs required. And bblfshd concatenates all the metrics from all the drivers to present it to Prom with different driver IDs.

In fact, if we were running only a single language instance per daemon, we could have just concatenated the metrics - the Prom text format allows that. But we can't do this when there are multiple instances, since metrics will collide. Or so I though initially.

But @creachadair made a good point, we can just generate IDs randomly on the driver's side and let bblfshd concatenate metrics, instead of parsing, assigning IDs and re-encoding them back. This should work well, as far as I know.

@kuba-- kuba-- self-assigned this Oct 3, 2019
@kuba--
Copy link
Member

kuba-- commented Oct 3, 2019

How about following design:

bblfshd runs a local prometheus instance and the global prometheus pulls metrics from all bblfshd instances.
Next to the prometheus server on bblfshd (it can be in the same process), we will also run some kind of prometheus aggregator (for instance this one: https://github.com/peterbourgon/prometheus-aggregator) and the background goroutine as a coordinator.
A coordinator will query each driver's prometheus, get metrics and send them to aggregator (typical io.Copy over the network).
The prometheus aggregator is connected to the local prometheus, so data will be automatically aggregated and pulled by local prometheus.

@creachadair
Copy link
Contributor

creachadair commented Oct 3, 2019

That design seems mostly reasonable, but I don't understand the purpose of separately aggregating the driver metrics locally in the daemon. If the drivers were going to export a prometheus instance anyway, we could have the global prometheus poll them directly.

I realize the way it's set up right now that would not work, because of how we run the drivers in libcontainer. I think we mostly agreed on committing to Docker to manage our containers, though; and in that case we would no longer need a separate aggregation path—we could export the driver metric services directly. Was that the main motivation for the separate aggregation step?

@kuba--
Copy link
Member

kuba-- commented Oct 3, 2019

Partly yes (as I understood correctly @dennwc concerns). On the other hand, we create new and kill old drivers. Drivers run in containers and have to expose some random port for prometheus, on the host.
So, every time when we kill/create a new driver we have to de/register a new local prometheus in the global one (to let pull metrics). I believe prom. client can do it dynamically for running container.
Anyway, personally I prefer to switch to docker and just pass configuration to running containers , instead of aggregating anything on daemon. But if it's not a case, then we can think about proxy aggregator.

@kuba-- kuba-- removed their assignment Oct 4, 2019
@dennwc
Copy link
Member Author

dennwc commented Oct 4, 2019

Agree on switching to docker. The main reason why I insisted on aggregating metrics from drivers in the daemon is because it's the only way to expose metrics without additional configs.

So if we can switch to Docker, this removes any concerns on my side - it's much easier to let Prometheus to pull all metrics from all containers (including drivers) and be done with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants