Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

$online - unexpected connection lost #24

Closed
timpur opened this issue Jun 14, 2017 · 38 comments
Closed

$online - unexpected connection lost #24

timpur opened this issue Jun 14, 2017 · 38 comments

Comments

@timpur
Copy link
Contributor

timpur commented Jun 14, 2017

Wondering if its better to use the last will to post information about unexpected connection lost

Currently disconnecting correctly and unexpected are the same = false at $online

Wondering if its better to use a number, 1 = online, 0 = offline and -1 = unexpected connection lost ??

Tell me what you think.
Tim.

@marvinroger
Copy link
Member

Good point. What do you guys @Kwave @ThomDietrich @euphi think about this?

I'd prefer an explicit text instead of code numbers. $online would not be a binary state anymore, so maybe we should change the attribute to $state or $status (don't know what's the most correct in english). The values could therefore be online, offline (clean) or lost (unexpected disconnection).

@euphi
Copy link
Member

euphi commented Aug 1, 2017

$state or $status is good (I'm unsure which one is better, both are ok).

For me, offline is too generic for a clean disconnect, so maybe disconnected is better.
There may also be a state sleep for deepsleep. (However, this is a little bit problematic, because the state will stay in sleep if the device cannot reconnect for other reasons.

Also, you could send a state after MQTT connection:

  • reconnect if only MQTT connection and/or WiFi was lost
  • reset if device was resetted
  • poweron if there was a power failure
  • wakeup after deep sleep (On ESP8266 this seems to be possible from API using the struct resetInfo but it seems this is somehow buggy?)

Just after this, Homie shall send the online state. This allows to see a reset/reboot reason by subscribing to $state.

@timpur
Copy link
Contributor Author

timpur commented Aug 2, 2017

I like these ideas. Maybe have both?
State is simply on or off ?
Status is more complex, Connected , disconnected, lost, sleep, etc.
??
This way we have best of both worlds, do not need to change the standard, but also adding more information about the devices current status?

In dress to @euphi

reconnect if only MQTT connection and/or WiFi was lost
reset if device was resetted
poweron if there was a power failure
wakeup after deep sleep (On ESP8266 this seems to be possible from API using the struct resetInfo but it seems this is somehow buggy?)

Not sure about these. These seem more like events? Maybe we could add events topic? but this is difficult because some of these events occur when the device is disconnected? Reason to split these up is, State and Status should always be set to the current state or status, so that when ever you subscribe to these topics you receive the current status or state of the device no mater when you subscribe to these topics. An Events topic would describe the the events that occur, but not necessarily the event its currently in (maybe make this not retained pub?). If this make sense.

Anyways thats my thoughts. Tell me what you think.

@euphi
Copy link
Member

euphi commented Aug 2, 2017

A simple $state and a more detailed $status are not possible, because the "unexpected loss" logic is based on the "last will". MQTT allows for only one last will.

@euphi
Copy link
Member

euphi commented Aug 2, 2017

The "reconnect" events are useful to see the reason of a lost connection.

The advantage of having a own state for this, is that it you can see also short connection losses (that does not trigger the last will) by subscribing to $online.

When you think about integration with home automaton systems, it is really useful to have one property to see reboots of a connected device.

@timpur
Copy link
Contributor Author

timpur commented Aug 2, 2017

Valid point about last will. It will have to be on one topic.
I agree with your ideas. Just not sure if there should be combined? I think state and events should be kept separate. Reason for this is when an event occurs and publish the message, then do you also publish over with the current status. (publish reconnecting then connected ?). Also we could also use this events topic for custom publishing of events, and thus do we worry about publishing over these with the current status?

That's why i think they should be separate.

Thus Status and Events is my conclusion but its not up to me its up to the community.

@gorootde
Copy link
Contributor

gorootde commented Aug 2, 2017

If the goal is only to differentiate between "clean" disconnects, and losses due to broken connections: I would use marvin's versions: online, offline, disconnected published on the existing $online topic.
The additional state info suggested by euphi (reconnect,reset,poweron, wakeup) are very device specific. As the Homie convention is device independent, I think these should not be a part of it. (e.g. What if my device has no Wifi and has no possibility to detect why it was reset?)

@euphi
Copy link
Member

euphi commented Aug 2, 2017

I think it is very important that if a device reconnects it is made sure that it sends on the $online topic a value other than online before sending online again.

@ThomDietrich
Copy link
Collaborator

ThomDietrich commented Aug 2, 2017

Just to say something: I'm undecided 😄
The task of the current $online topic with it's special property to be the LWT topic is to inform an interacting device/application if the device with it's furthermore published topics is available and in an operating and "interactable" state.
If everyone agrees with this definition, the $online topic should transport a clear "yes, interactable" or "no, not interactable" message.

@timpur @euphi could you please specify in which constellation a communication partner would need a more specific info than "Device not available for interaction"?

I'm not trying to silence your ambitions. The idea of an extended LWT-linked topic or the addition of an additional one might be meaningful but we need to find a use case for that!

@euphi
Copy link
Member

euphi commented Aug 3, 2017

I have a rule in openhab that sends me a telegram message if a device is offline.
This rule is only active for the deployed devices, so I also want to know if they restart, because this is a sure sign for a problem.

Currently the rule is based on detecting a negative step in uptime, which is somehow unreliable for restart of openhab or in case of overflow and does not work for a short loss of Wifi connection.

Especially a short loss of Wifi connection was not detected at all. This was quite problematic, because it happened to my irrigation controller and it seems that filling the water tanks to 100% and leaving a ladder (metal) close to the controller just before leaving for holidays worsened the Wifi connection, so sometimes it was lost for more than one day...
I now learned that signal quality of 8% seems to be the minimum for connection, so I set up an alert rule in grafana (influxdb) that notifies me if a device as connection quality of < 12%.

(BTW: I have 5 devices running permanently, two of them for more than 1 year, the others for several months, with different versions of Homie-2.0.0 and there was no unexpected reset yet. One is running for more than a year, the other had some resets due to improvement of installation (wiring etc.), in one case the circuit breaker of the 230V supply tripped. In this case the alert rule was also helpful to detect this problem that also affected the fridge, dish washer and oven...)

@marvinroger
Copy link
Member

That's a very valid use case. If we rename $online to $state, we keep a offline LWT message, and we send boot when the device is "booting" (and that's actually a state, not even an event), ready when the device is ready, and disconnected when it is cleanly disconnected. It would solve this, right?

@timpur
Copy link
Contributor Author

timpur commented Aug 7, 2017

Got my approval. Just not sure about the state names.

Boot (first thing it does when connected is send boot). Finishes booting and sends ready. Then disconnected when clean and lost when not clean disconnect?

Also don't think you have to get rid of $online could have both? Up to you.

@euphi
Copy link
Member

euphi commented Aug 7, 2017

Most important now is to come to a decision to use only one topic with fine granulated state or simple states and an extra topic for detailed information.

States should be generic, but we can use Homie-ESP8266 as example what is technically possible.

Thus said, states would be:

  • online - connected to MQTT (and system running OK)
  • connect - MQTT connected, but not all mandatory topics sent yet
    • This could be substated into:
    • boot: Restart after Reboot
    • reconnect: MQTT connection was lost and is available again
  • (off_)disconnected - clean disconnect
    • This could be substated into:
    • off_update: Reboot for SW update (details like OTA, Web-Interface, Flash via Serial are implementation-dependent and shall not be send on generic topic). This may include a reboot after configuration was updated only.
    • off_sleep: (Deep) Sleep state (details are device/implementation specific)
    • off_power: Device powered down (Note: This includes the case that just the Software running homie was shut down, e.g. on Raspberry PI).
  • offline or off_lost- unexpected disconnect (based on MQTT LWT)
  • on_failure - Connected to MQTT (so similar to "online"), but major internal error, e.g. important sensor not available.
    To discuss: Is this useful?

Personally I prefer to have all this states above on one $state topic. "connect" is not necessary then.
"disconnected" may still be useful if there is another reason for disconnect than "update", "sleep" or "poweroff".

BTW: It is intended that the fully connected states start with "on" and the fully disconnected states start with "off". This helps to have simple pattern matching rules.

@timpur
Copy link
Contributor Author

timpur commented Aug 9, 2017

all i wanted was to know when the device was 'lost' :P, but having more clarity certainly helps and i think its a good idea. That said im not sure what is the right choice in all the options.

I think both @euphi and @marvinroger have the right ideas, so trust you guys to work this out (the fine details) :P.

@bertmelis
Copy link

I wouldn't over-engineer this and take 2 seperate topics. One mandatory LWT-topic with only 3 possibilities and one implementation-specific with as many possibilities as needed. So it is very clear if the device is available or not. If you want to know more: check out the (for example) $implementation/event topic.

As you can only have one testament, you cannot distinguish between powerloss and crash for example. So you'll lose information anyhow.

PS1 How would you send "booting"? When the device is booting I would think there's no MQTT-connection yet.

PS2 Keep in mind that MQTT allows for delivering messages ordered but it is not by default guaranteed. Although Homie will probably work with the different states that change quickly, it is in theory undefined.

@ThomDietrich
Copy link
Collaborator

ThomDietrich commented Aug 9, 2017

@euphi I'm with @bertmelis and @marvinroger on this one. I'm not sure if we need all these disconnect substates. Also the "reconnect" state is not a state but an event. But without the intention to disregard all your states, could you name those you are interested in specifically and describe why the convention needs to unify them across clients and controllers?

@marvinroger @timpur trick question: If we define the LWT message "offline" and additionally publish a "disconnect" message in certain cases, is it okay to still receive the LWT message? I'd need to check but looking at the implementation that might become tricky. At least the python paho-mqtt package as I know it doesn't offer the option to reset or delete the LWT message while connected.

Edit: It seems to be possible, still the method could get tricky to implement and we should consider that while specifying $state. @marvinroger I trust you have more low-level experience with that.

@marvinroger
Copy link
Member

@bertmelis I meant "booting the Homie announce" more than an hardware boot.

@ThomDietrich discarding the LWT is indeed only possible by cleanly disconnecting, and for MQTT, this means sending a DISCONNECT packet, waiting until TCP ACK, and then closing the connection. This is not tricky; all spec-compliant MQTT clients should support this. Even the low-power ESP8266 supports this with the async-mqtt-client client.

@euphi
Copy link
Member

euphi commented Aug 9, 2017

Minimum states that are necessary (Textual description):

  • clean disconnect
  • unexpected disconnect (LWT)
  • connected and device attributes and property announcement sent
  • connected and device attributes and property announcement not yet complete

The last one ist necessary to make auto-discovery reliable (e.g for openhab).
Without this state, the controller cannot ensure that it has received all attributes/properties yet.

This could be especially problematic with bad network connection.

@bertmelis
Copy link

To make sure all your requirements are met, you also have to be sure the broker sends all the messages in order. With QOS1, this is only the case when the maximum number of in-flight messages is only 1, which is not by default on mosquitto.

@euphi
Copy link
Member

euphi commented Aug 9, 2017

Yes. However, the homie-implementation can wait for the messages to be acknowledged, before sending the next message. This can be achieved with a simple state machine.

The attributes and property announcement can be send at any order, so there would be only 3 phases necessary: "fresh connect" [send $state -> "connect"], "announcement" [send all device attributes and property announcement in arbitrary order], and "connected" [send $state -> "online"].

@marvinroger marvinroger added this to the v2.1.0 milestone Aug 11, 2017
@marvinroger
Copy link
Member

So, do we agree with that:

  • clean disconnect: disconnected
  • unexpected disconnect (LWT): offline
  • connected and device attributes and property announcement not yet complete: init (boot might indeed be confusing)
  • connected and device attributes and property announcement sent: ready

@euphi I think you forget one thing: what if the controller connects after the device announces itself? The ready $state might indeed be received by the controller before the other messages. So there are two solutions:

  • Either require every required attributes before considering a device is discovered. Therefore, the controller can determine whether all messages are received. But this is problematic for optional attributes, we can't know if its set or not. A solution to this would be to require a $stats attribute set to what stats are actually defined ($stats -> cputemp,cpuload)

  • As @ThomDietrich said, you can consider some safe defaults and as you receive the new attributes, you can improve your knowledge about the device

@euphi
Copy link
Member

euphi commented Aug 11, 2017

The value of the device statistics are not part of the initial announcement yet. So we need a mandatory $stats topic anyhow to auto-discover the $stats attributes. :-)

Then, only $implementation and the property attribute $unit is optional.
For $implementation its up to the implementation to make it auto-discoverable.
$unit then should be made mandatory with an empty string as allowed value.

@marvinroger
Copy link
Member

I was thinking exactly the same thing about $unit. 😉

Let's do that!

@bertmelis
Copy link

bertmelis commented Aug 11, 2017

I don't know if an empty string is going to work as everything is sent as retained. An empty payload means the topic will be deleted by the broker. Maybe you should use none.
Or am I missing something?

Edit: my comment is not valid, since we're sending with QOS1 hence an empty message will be retained.

@marvinroger
Copy link
Member

@bertmelis no, you're right... I thought empty strings were 1 byte long (null terminator), but I was wrong, the null terminator is not included in the payload. Therefore, the message would not be retained. Let's go for none.

@euphi
Copy link
Member

euphi commented Aug 11, 2017

none, NULL, null, void, or NIL? :-)

@marvinroger
Copy link
Member

Done: 0003daa

We only need to agree with #24 (comment) and we should be good 😇

@euphi
Copy link
Member

euphi commented Aug 11, 2017

I could live with these 4 states only and their propose names (ready is better than online, init better than boot). Maybe offline could be renamed to lost?

However, I still think an extra state for deep sleep would be useful.
The difference between disconnected and sleeping is, that disconnected is a planned shutdown (e.g. closing the Homie-Program), while sleeping is a state where the device is supposed to reconnect frequently.

Regarding an alert state: I'm used from my "domain" at work that device have some self-diagnosting abilities. The simplest variant to do this, is to tell the controller in a standardized way, that there is some major problem. Having this in the convention is much easier than to have an implementation specific solution.

@marvinroger
Copy link
Member

Then:

  • init when connected
  • ready when fully announced
  • disconnected when cleanly disconnected
  • sleeping when cleanly disconnected for deep sleep
  • lost when "badly" disconnected

How would you implement your alert state?

@euphi
Copy link
Member

euphi commented Aug 12, 2017

alert would have the meaning "homie is running fine and is connected, but the device is not able to perform its main task", e.g. Sensor is providing no data.
So setting the alert-state is up to the "Application" that makes use of Homie.

The actual homie-implementation (e.g. homie-esp8266) should offer an API call to set the alert-state.
Everything else, e.g. provide diagnosstic messages is implementation specific and out of scope of the homie-convention.

I'm aware that this is an extra feature, but I think a standard for this is quite useful for integration. It helps monitoring your devices.

For example you can easily shown them as "yellow" an a Homie "Dashboard": green OK, red lost, blue sleeping, white disconnected, orange: alert.

@marvinroger
Copy link
Member

Alright:

  • init when connected
  • ready when fully announced
  • disconnected when cleanly disconnected
  • sleeping when cleanly disconnected for deep sleep
  • lost when "badly" disconnected
  • alert when something's wrong

Is everyone ok?

@timpur
Copy link
Contributor Author

timpur commented Aug 13, 2017

I like :)

@ThomDietrich
Copy link
Collaborator

ThomDietrich commented Aug 15, 2017

Hey, I'm back from the weekend and like what I'm reading 👍

I also agree that the "alert" and "sleeping" states are meaningful and can be useful for certain cases.

Regarding the discovery process: I'd still strongly vote to follow the idea of

you can consider some safe defaults and as you receive the new attributes, you can improve your knowledge about the device

Despite a few fundamental topics, all other topics should be optional and hence provided with a default in the convention. A discovering controller (like the openHAB binding) has to make sure to catch all retained messages before further processing the devices/nodes/properties by applying a simple delay/timeout mechanism. That should be easy and other services work similarly. @Kwave what do you think?

Back to topic: "I like :)"

@timpur
Copy link
Contributor Author

timpur commented Aug 27, 2017

@marvinroger
Plans moving forward ?
Have we concluded on the new structure?
Implemented in beta 2 ?

@euphi
Copy link
Member

euphi commented Aug 28, 2017

A reference implementation is part of the implementation, so I created an issue at homieiot/homie-esp8266#390.

@marvinroger
Copy link
Member

marvinroger commented Sep 21, 2017

Indeed, let's move forward.

Can we guys read the whole thing at https://github.com/marvinroger/homie/tree/redesign and write your final comments/thoughts here?

@timpur
Copy link
Contributor Author

timpur commented Sep 24, 2017

New standard looks amazing and solves so many of my issues that i solves with just having my own standard like also adding a "units" attribute.

(TODO Clarify device behaviour in extra section for $online (or $state) topic.)

Not sure how to comment on this though ?

@marvinroger
Copy link
Member

Closed by #50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants