Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add liveness health check support #116

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

stanhu
Copy link
Collaborator

@stanhu stanhu commented Mar 8, 2021

When MailRoom is run in Kubernetes, we have found occasions where
MailRoom appears to have attempted to stop running, but Net::IMAP is
stuck waiting for threads (ruby/net-imap#14).

This commit adds an HTTP liveness checker to enable detection of a
terminated MailRoom pod.

@tpitale
Copy link
Owner

tpitale commented Mar 8, 2021

Could this be accomplished if we added something that responded to SIGINFO or something like that?

@stanhu
Copy link
Collaborator Author

stanhu commented Mar 8, 2021

I think SIGINFO is a BSD construct, so this would only be supported in macOS or FreeBSD. Using signals in general isn't the most cross-platform friendly way of doing monitoring.

@tpitale
Copy link
Owner

tpitale commented Mar 8, 2021

It doesn't need to be SIGINFO specifically. Linux supports signals generally. And I think there are a lot of other issues with mailroom that would prevent running on windows anyway.

@tpitale
Copy link
Owner

tpitale commented Mar 8, 2021

In general, I'm apprehensive about adding webrick and a web service into the mix.

I'd rather have another repo/project that could provide a web interface of some sort, that was able to query mailroom through some other means 🤔

@tpitale
Copy link
Owner

tpitale commented Mar 8, 2021

Would it be preferable instead to have mail_room report out, like a heartbeat? I feel like that would require less code, generally be safer.

@stanhu
Copy link
Collaborator Author

stanhu commented Mar 8, 2021

In this case, I think that is more complex because now you need to have a separate process that determines whether the process is alive. HTTP liveness probes are a common practice in Kubernetes: https://www.magalix.com/blog/kubernetes-and-containers-best-practices-health-probes

As for push vs pull for metrics, Prometheus has written extensively why they prefer a pull model for monitoring, particular for detecting a downed service:

  1. https://prometheus.io/docs/introduction/faq/#why-do-you-pull-rather-than-push
  2. https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/?utm_source=thenewstack&utm_medium=website&utm_campaign=platform.

@stanhu stanhu force-pushed the sh-add-liveness-health-check-upstream branch from 0e19593 to d96292d Compare March 15, 2021 21:02
@tpitale
Copy link
Owner

tpitale commented Dec 14, 2021

Okay. Re-reviewing this.

If we're going to do it, I'd like to change a few things.

  1. I'd like to be more explicit that this is an HTTP health check, so that if we add other kinds of health checks down the line, this won't have to move in the configuration or naming
  2. I'd like the default not to be nil, but a NoopHealthCheck, been trying to keep from using nil configuration and the &. pattern

I can leave more specific comments in the code, if that is helpful. Sorry I didn't get back to this for months and months (new baby).

@stanhu stanhu force-pushed the sh-add-liveness-health-check-upstream branch 7 times, most recently from c939eaa to cec63fd Compare December 15, 2021 06:02
@stanhu
Copy link
Collaborator Author

stanhu commented Dec 15, 2021

@tpitale Congrats on your new arrival! I've updated this pull request; let me know what you think.

When MailRoom is run in Kubernetes, we have found occasions where
MailRoom appears to have attempted to stop running, but `Net::IMAP` is
stuck waiting for threads (ruby/net-imap#14).

This commit adds an HTTP liveness checker to enable detection of a
terminated MailRoom pod.
By default, the `NopHealthCheck` will be used.

Make it explicit that we have an HTTP health check in case we need to
support different health checks down the line.
Also ensure the health check is never nil.
@stanhu stanhu force-pushed the sh-add-liveness-health-check-upstream branch from cec63fd to abb3d6d Compare May 12, 2023 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants