Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add health status endpoint for monitoring #185

Merged
merged 2 commits into from
May 17, 2022

Conversation

MalKeshar
Copy link

Allow other monitoring software(like zabbix and so on) to check graphite-clickhouse health via standalone health check endpoint.
Similiar to https://prometheus.io/docs/prometheus/latest/management_api/#health-check

@@ -198,6 +198,10 @@ func main() {
}
w.Write(b)
})
mux.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion here it doesn't make much sense. I always used /render?query=some.metrics.definitely.does.not.exist&from=-1min or similar for health check. It checks the metric exists and does an actual query to storage.

This handler looks more like /alive, it doesn't check for the database behind or so

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that "alive" is preferable option. Will change it. Still think, that standalone point for alive check is needed.
Because "/render?query=some.metrics.definitely.does.not.exist&from=-1min" is functional check and this not the same as "basic health check", usually used for monitoring daemon health itself.

With this check if backend dies, we got two alerts instead of one. If I need overall pipeline check, I will use query to carbonapi

Copy link
Author

@MalKeshar MalKeshar May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Felixoid, напишу лучше на русском, на английском мне такие тексты сложновато писать. Путаница между мониторингом на живость самого сервиса и функциональных проверок на мой взгляд обычно приводит к тому, что при выходе одного сервиса случается шторм алертов. Вот у нас есть цепочка сервисов
clickhouse->graphite-clichouse->carbonapi. Если мы везде добавляем функциональные проверки вместо простых проверок на живость отдельных сервисов, то при выходе из строя clickhouse у нас придёт три алерта вместо одного. Если у нас более сложные зависимости с большим количеством сервисов, то это превращается в ад.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Оно, в целом, совершенно валидно при прочих равных. Но от того же балансера, к примеру, по этой ручке я бы отстреливать не стал, потому что она проверяет живость сервера, но не его работу. Для балансера нужна функциональная ручка.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Оно, в целом, совершенно валидно при прочих равных. Но от того же балансера, к примеру, по этой ручке я бы отстреливать не стал, потому что она проверяет живость сервера, но не его работу. Для балансера нужна функциональная ручка.

Да, мой основной тезис не про то, что функциональных проверок не должно быть. Для разных сценариев нужны и те и другие. Спасибо за принятый пул-реквест!

@MalKeshar
Copy link
Author

MalKeshar commented May 17, 2022

@Felixoid, I made it because since version 0.13.2 our previous health check no longer working:

# curl -I 'http://xxx:9090/render/?target=test&from=0&until=0'
HTTP/1.1 400 Bad Request
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
X-Gch-Request-Id: 4d5b40c0b0c4f1abad4932787ea2303a
Date: Tue, 17 May 2022 14:25:32 GMT
Content-Length: 126

In the logs error:
"format is not supported, supported formats: carbonapi_v3_pb, pickle, protobuf (aka carbonapi_v2_pb)

Was found via
#118

@Felixoid
Copy link
Collaborator

Yep, you can add format=pickle, should work

@MalKeshar
Copy link
Author

MalKeshar commented May 17, 2022

Yep, you can add format=pickle, should work

I not sure it is good idea to rely on thing, which can break in any time.

@Felixoid
Copy link
Collaborator

Felixoid commented May 17, 2022

I don't think one would remove an introduced format. It's one of the default graphite data exchange formats, and graphite-clickhouse must support it in favor of compatibility with graphite-web ¯\_(ツ)_/¯

upd
the query /render/?target=some.metric&from=0&until=0 is sent to a frontend, so either graphite-web or carbonapi. It works there and checks the whole chain direct to storage. For graphite-clickhouse additional format=X should be used

@Felixoid Felixoid merged commit 1bdb4a9 into go-graphite:master May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants