-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Added HealthCheck #33
base: main
Are you sure you want to change the base?
Conversation
Updated to add Healthcheck to the container
Added Sponsor Message
Removed Extra Blank Line
i do not think that's a good way of doing it... if the status goes to offline, the zerotier client itself will try to recover, so no need to fail a healthcheck and restart the container i also do not think it's always a best practice, e.g. look at some official docker containers, they do not have a healthcheck defined by default |
The reason for this is that currently, if other containers are dependent on this, then they have a problem in starting as the IP & Ports are not allotted. Checking this ensures that this container has started up so the other containers depending this can be started. How do you want me to implement this ? I Can change the implementation. The reason I checked for being "online" is that the docker image is not exactly of any use when it is "offline", since most people use it for remote connection (VPN) and internet is required for it. But I am open to suggestions for other implementations. I do feel building a healthcheck is necessary. |
I hope nobody minds if I chime in here with a few observations. I apologise in advance for the length. As a general principle, I support the idea of health-check scripts. I think they add to the overall user experience of Docker containers. The typical scenario where a health-check status is very handy is:
Over on IOTstack where I spend a fair bit of time, I've added health checking to a bunch of containers. In some cases (InfluxDB 1.8 and Grafana) I've done it via a So, I definitely "get" the idea of doing this. I have also experienced push-back when either I or some third party have pointed to the IOTstack implementation in the hope it, or something like it, would be adopted by the upstream maintainer. I also agree with Lukas about many official containers not having health-checking by default. All the containers where I've added the feature in IOTstack qualify as "official" and "mainstream". Bottom line: I think this seems like a good idea from a usability perspective so please see this as a "+1" for your proposal. To be clear, I'm supporting the idea of a health-check script. I haven't done any investigation into what As to the specifics of
implicitly assumes that the
That way, null or anything unexpected (save for embedded quotes) will still evaluate as a string for comparison purposes. I've also added a space between Now I'd like to focus on what it is that you are actually trying to achieve. That's something you haven't really explained. It's possible that, if I understood your actual use-case, I'd immediately see the practical application. In the absence of that knowledge, I can only talk in generalities. In general, health-check scripts are slow. You've specified
So, while I'm convinced that health status is a valuable user diagnostic aid, I'm not at all persuaded that it's going to prove valuable as an automation aid. I hope that makes sense. I also suspect that's a point Lukas was making. More generally, I think you are arguing with one of the key design assumptions that underpins TCP/IP:
That might sound simple and self-evident but it has far-reaching consequences. ZeroTier, whether it's running in a container or otherwise, is a (logical) "device" that sits between a client and a server. Its core job is to forward packets. Irrespective of the reason, any inability to forward a packet means the packet gets dropped. That's something the client and server need to handle. It's an end-station responsibility, not something that should be delegated. Sure, it might be convenient to make the assumption that the nearby ZeroTier container going unhealthy covers a multitude of sins (no DNS resolution, no viable local route to the Internet, ZeroTier Cloud being offline) but it sure doesn't get all possible reasons why a client and server can't talk to each other, such as whatever physical device is running the ZeroTier service at the other end not being available, or device present but container not running, or device present and container running but simply not having a viable path to the Internet because the router at the other end is down. It's a non-delegable client and server responsibility. Before I make the next point, I'd like to thank you. Until I read your PR and went to the Docker documentation to double-check my understanding, I was not aware that the Until today I thought that a Even so, I don't think it's going to be all that robust. I've done some experimentation. I didn't want to either interfere with my existing ZeroTier network or set up a separate ZeroTier network just for experimentation so I used Mosquitto and Node-RED as proxies for ZeroTier and "some other container", respectively. I augmented Node-RED's service definition with:
For its part, Mosquitto has an IOTstack health-check script which uses Tests:
|
Changes as proposed by @Paraphraser
Actually shell behaves differently on different distributions(eg. some have ash instead of bash), nevertheless, I have updated the code as per your recommendation. Thanks. 🙏 Many thanks for your detailed Test. 👍 Now for some of your specific statements/questions :
Agreed. But no harm in having a health-check in the middleware also.
For my use case- YES. To better describe this, I use a software which takes the IP from ZeroTier. If the ZeroTier is down, the IP is not allotted and the Software container fails and I have to manually restart the software container while keeping the ZT container as is. This happens everytime the docker compose is run without health checks. Now, once the ZT has IP, then even if the network is dropped, I do not need to start the software container (since your rule of client-server dropping packet applies) hence docker is free to restart ZT container only when it is unhealthy. So even if the Software Container is not restarted (like in your case the Node-Red), it does not matter for my use case scenario. (Though I still think Secondly, at times there are certain network hangs (like router restart) from the backend which sometimes 'hangs' the Zerotier client (or ZT client takes long time to recover). Though this happens mostly on consumer grade hardware (like my home-lab since server grade h/w is beyond what I can afford), having a health-check is again useful. Lastly, once you move to container orchestration (e.g. kubernetes) you would appreciate how important it is to build healthchecks in the docker containers. Although another counter-point is that for containers without health checks, it can also be added directly through docker compose.
Well there will always be a debate on 'slowness' vs 'performace'. If you have faster healthchecks, it will also take cpu cycles possibly degrading performance. Besides, what is specified in Dockerfile is only the default values. They can be changed at runtime by the user anytime either by using
I already explained my use-case where it is also helping me in automation aid. Regarding the implementation of healthcheck, please read my earlier post wherein I have clearly stated that I am open to better implemenation of healthcheck if someone can find and propose the same. 🙏 |
if I'm reading it right, content of healthcheck.sh could be reduced to sth along the lines of
(with 'apk add jq' prerequisite) Not saying it's better, just different. |
this one is from the official container, maybe worth looking into? i did often run into a situation where zerotier-cli status returns online but no network interface was created... this might be worth checking as well |
This essentially does the same thing (which is checking the status as "online") albeit using a different program(apk). I personally feel there is no point in adding another package (jq, which is essentially a json parser) if we can do without it. The leaner and lesser programs installed in a container, the better (as a general rule), so I think we should only install necessary apks. |
That is true. The way
Definitely!
Though 1 question still bugs me... In your approach, What if user has specified multiple networks, and some of them are down but not all of them, in that case what do we do? should the container be 'healthy' or 'unhealthy' ? Which approach would you like to go by ? |
I started to write this reply last night (I'm at UTC+10), intending to finish it this morning, but I see things have moved on a bit.
For some reason, I did some more thinking about your test for ONLINE and it led me here. Unless I'm mis-reading that, the In reading the documentation, it seems that TUNNELED means that the client accepts the default route override. I set up another container to run as a client, joined it to my existing network, then enabled I'm not sure what to make of all that. However, if it were me, I'd assume a status of TUNNELED could be taken at face value and that it would imply the client was still ONLINE. In that case, I might express the test like this:
BUT please put a pin in all that and study this. First, the normal situation:
In words:
Now let's make a mess by leaving the ZeroTier network:
What's the story?
In words:
I'm now wondering if ONLINE means what we think it does. I think this was the point Lukas was making.
I didn't actually find the behaviour all that surprising. It seemed to me that it was behaving exactly as documented. Another test I've done in the meantime is to force a change to the Mosquitto service definition (adding a nonsense environment variable). The subsequent "up" recreated both containers, starting them in the order Mosquitto, then Node-RED when Mosquitto went healthy. And when I say "recreated" I mean that. The Node-RED container was not simply restarted. Conversely, if I merely restart Mosquitto, the Node-RED container merely restarts too (ie it's not a container re-creation).
Intriguing. I have found ZeroTier to be absolutely rock solid. I never have to touch it, no matter what else happens. I'm running the ZeroTier-router container in close to Topology 4. My local and remote ZeroTier routers (the A and F in the diagram) are on Raspberry Pis. Other devices (Macs and iOS; the B, E and G in the diagram) are running the standard ZeroTier clients (ie not in Docker containers) just as they come from Atlassian. The biggest problem I have ever had with ZeroTier is documented here and it was peculiar to the GUI widget controls in macOS where the condition was triggered by upgrading to a new Mac. On the other hand, having visited ZeroTier Central a few times in the last 48 hours and been peppered with "upgrade to paid" popups, I'm starting to wonder whether ZeroTier has launched itself onto the enshittification curve?
I have no experience with this so I'll take your word for it.
If you are able to replicate the example I gave you above (where leaving the ZeroTier network still saw the service reporting ONLINE), you might perhaps try something like this:
This is more-or-less where I got to last night. Please keep reading...
The zerotier-router container supports But how about something like this?
If all you care about is a single network created by the client when it has connectivity with the ZeroTier Cloud, the default of 1 will suffice, the health-check will work, and nobody will need to add that variable to their service definition. In other words, you'll have full backwards compatibility. On the other hand, if you typically propagate managed routes, enable full tunnelling, and so on, what constitutes "normal" will be a larger number of entries in the routing table. Then you can add the variable to your service definition and tune it to the correct value. How does that sound? |
If it were me, I'd probably use
Two lines is good for debugging because you can stuff an Like you, I'm not saying better, just different. 😎 |
Here's a proof-of-concept for you to consider if you think counting routes is a reasonable approach... setup
tests
However, as before, none of that affected the Node-RED container but if you reckon going "unhealthy" will trigger events in a Kubernetes environment then, great! Two situations where Node-RED is affected are:
But then there are situations where a docker-compose command affects the ZeroTier container without affecting the Node-RED container:
for the record...
I know docker-compose v2.29.2 has been released on GitHub. It hasn't made it into the |
Yes. What matters IMO is that it sources structured data instead of scraping stdout, and tries to avoid making assumptions about what flavour/version of /bin/sh you happen to get. IMO that pays off the price of adding jq. Seeing a 'jq .status' line in the script provides more hints than 'awk '{print $5}'' (should zerotier-cli change the output format and someone needs to go in and make the script work again), and the json interface exposes more details, so you might be able to build smarter/more accurate checks based on it (if ever needed). Not major features, but still good practices. Of course both versions achieve the same result, today. |
@hoppke
I absolutely agree with this. However, we still might to change the script in case zerotier plans to change the name/label & value pairs itself. I think let us first decide on the method (i.e. in which conditions the node should be considered 'healthy' and in which cases do we need to make it 'unhealthy' ) |
@Paraphraser
I think its up to @Paraphraser @hoppke & @zyclonite to decide what we finally want to do and I'll try to build up the script accordingly. We can also give user the option to check client (client status) or network (status) How does the above sound ? Personally, the container itself is the client, not the network, so I had earlier done a healthcheck based on the client not the network. But I am glad that you guys went into the details and we now have so many good suggestions from @hoppke , @Paraphraser & @zyclonite . Waiting to hear from you guys on what you think should be done. |
I have been doing some more research. Please study this:
So, what we have is the client reporting itself OFFLINE, yet the expected routes are present, and the container is reporting itself to be "healthy". How I contrived this was by adding some net-filter rules to block port 9993. The situation is actually a bit weird. I've blocked UDP port 9993 as both a source port and destination port, at both the pre-routing and post-routing hooks, and for both IP and the bridge. I know the filters are working because the associated counters are incrementing. Yet a tcpdump shows the host (a Proxmox-ve Debian guest) is still both receiving and transmitting some packets on UDP port 9993. I don't understand how this is happening and, for now, I'm assuming it is some artifact of the filter rules that Docker (and Zerotier) add to the tables, and those are getting in first. Nevertheless, this has all turned out to be beneficial, precisely because it demonstrates that it is possible for the container to be both OFFLINE yet still capable of both adding routes to the routing table and forwarding packets across the ZeroTier Cloud. With that in mind:
Tests:
Which brings me to:
Well, you can't get routes without the associated network interface so I think checking the expected number of routes probably covers the "no interface" condition. But I do agree with the inverse of your proposal because I've just demonstrated "offline" but still forwarding packets, hence the revisions above.
Well, save for being able to nominate a specific network, that's what this revised proposal is doing. Ultimately, you can only build so much functionality into a "health check". That's not because you can't write complicated code into the script - you can. It's because scripts only return a binary results (healthy or unhealthy). Just thinking "out loud", if I had a problem such as you describe where a ZeroTier client joined multiple networks, and I wanted different "reactions" depending on which network went down, I'd move the goal-posts a bit. As well as having the script return healthy or unhealthy, I'd add the Mosquitto clients to the container and publish detailed status. Then, the dependent container (the "Node-RED" in my example) could subscribe to the topic(s) and take appropriate action. |
Out of curiosity, would it be possible to configure node-red to check connectivity to certain "landmarks" (hosts/services) on the meshed network and automate directly around that? E.g. I've a box somewhere that monitors the local ISP's reliability not by fetching WAN/LAN statuses from the router/modem (even though it offers snmp), but by periodically pinging the gateway and a known external "evergreen" (like 8.8.8.8). It should be possible for a node-red appliance to pick up "host/network unreachable" events in some generic way without getting vendor locked-in by ZT/wireguard/citrix/... |
I think we might be heading away from the subject-matter (a health-check script for ZeroTier) but I'll try to answer. I'm not immediately sure what you mean. If you mean:
Yes. Install one of these:
The former needs an external trigger (eg "inject" node) while the latter can self-time (eg every 60 seconds). In both cases, they return "false" in the payload if the ping fails. You feed the output to a "switch" or "filter" node and then handle the situation as you see fit. If you meant something like:
No. Or, more precisely, even if it was possible, you shouldn't. That's what routing protocols are for. There's no need to reinvent that particular wheel. If you meant something like:
Yes. Indirectly. Try running this on the host:
Then, from another terminal window on the same host, do something that causes a routing table change, such as downing and upping the ZeroTier container. You'll get a bunch of status messages. You can forward those status changes as MQTT payloads like this:
In theory the You just write a script around that which you fire-up at boot time with an Does that help? If you want to go further, please join the IOTstack Discord and ask me there ( |
i like the discussion a lot, it highlights all the different aspects just some thoughts on the main topic of having a healthcheck in the first place... what are we really trying to solve? |
I meant to take the proverbial step back. ZT can tell you that it think's it's "online". Is sufficient to say things are "good"? If there's more involved (e.g. firewall rules, DNS resolution, auth etc.), then maybe ZT can never deliver all the info needed to support automated decisions, and a different source would be better. I can check if ZT thinks it's up, but I have ZT in place for a reason - to set up connections across the tunnel. So maybe a simple ping test across the tunnel can tell more than ZT "status" ever could? |
@zyclonite Well, my view remains unchanged. A health-check is a nice-to-have and, ideally, all containers should report their health. But I've never seen it as more than a human-readable value which turns up in The expanded Even the short-syntax form isn't all that useful. It has always seemed to me that, if you say "Zigbee2MQTT depends on Mosquitto" then taking Mosquitto down should have the side-effect of also taking Zigbee2MQTT down first. That takedown order does happen if you down the whole stack but not if you just down a dependency. The notion of a dependent container taking some programmatic action in response whatever health state its dependency happens to be in seems to me to be just plain weird. I also reckon my testing in the earlier parts of this PR shows the (docker-compose) implementation is a bit half-baked, at least in its current form. Basically, in order to be useful, a dependent container needs to know a lot more about the internal state of its dependency than a binary health indicator will ever be able to communicate. I also agree with "zerotier might recover on its own". Until I started folding, spindling and otherwise mutilating ZeroTier's environment while testing for the purposes of this PR, I've never yet seen it go wrong. Plus, as soon as I unfolded, de-spindled or healed whatever mutilation I had put in place, ZeroTier always seemed to recover all on its own. No restarts needed. So +1 to the idea of reporting the health status but -1 to relying on that to automate anything. ex1: concur but "unhealthy" will probably send you to the log where (hopefully) you'll find a clue. ex2: definitely concur. ex3: I see this the same as ex1. ex4: abso-fraggin-lutely!
You could make that argument about any container with a health-check script. That someone might misuse a feature is no argument against providing the feature, so long as it has at least one beneficial use.
Not sure what you're getting at. If you mean an external script could do all this (eg fire off an MQTT message if status goes OFFLINE, or if expected routes go too low), I agree. An external script also has the advantage of sensing "container not running". That last test is something I already do on the remote Pi in my network, including keeping track of re-tries and replacing
I'd see that as the price of greater robustness but if you'd prefer the And speaking of which (extra packages) adding
and initialise
so all containers derive from that. Heckofalot better than hacks like mapping @hoppke to be honest, I rarely bother with middlemen. I always want to talk to the dude in charge. In terms of this PR and your question, I see a health-check status as a middleman. Or merely hearsay evidence if you want to put it in lawyerspake. Suppose you have a ZT network of some arbitrarily-complex design involving multiple sites, multiple ZT network IDs, multiple ZT clients, and so on. Ultimately, the game is "can A talk to B?" but you can usually find proxies where A being able to talk to C implies it must also be able to talk to B, so you can reduce your monitoring complexity a fair bit. In general I find that something as simple as a You'll note the pattern here: sure, automate problem discovery but leave resolution to the human. There's nothing worse than chasing your tail because some automated process has a bee in its bonnet and is fighting your every move. Although pings have their place, they are pretty low-level in terms of the IP stack. I have often found that devices will respond to pings even though the higher-levels are frozen. A cronjob triggering an MQTT message and that message arriving tells you a fair bit more, with a much lower false-positive rate. But that's just my 2¢. Your network, your rules. |
Agreed that restart cannot solve the problem.
Agreed
Agreed
Agreed once more. My 2 cents on this issue : Principally, the intention of health-check is never to solve the problem automatically. The basic intent is to detect the problem. (Maybe that's why its health-check instead of health-resolve. ;) I am just kidding, please don't be offended, I just had to write the line coz I found it very funny without any ill-intention towards you.) It is not necessary that you restart the container using health-check itself, you may want to just mark the status for other scripts to take over and solve the problem or notify the user. It actually depends on different use-case scenarios so there is no 1 answer to it. eg. you want to call a web-hook on container going unhealthy etc. Tell you what: I'll Re-do this PR next week and include advanced options for health-check. The Health-check shall be disabled by default in the Dockerfile so everything works normally as before. I'll just include my health-check script in the image (should not be more than a few kbs). Those who want to run health-checks can run using runtime variables or compose command. For the rest, it will be same as before. How does that sound ? |
I have marked it as draft for now. |
@hoppke Pinging to check is a brilliant idea! but I feel that is what the |
Healthcheck has become a standard feature for deployments (specially in Kubernetes).