Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EMS-ESP 3.6.0 crashes after hours/days - no response from ems-esp #1264

Closed
yazend opened this issue Aug 16, 2023 · 205 comments
Closed

EMS-ESP 3.6.0 crashes after hours/days - no response from ems-esp #1264

yazend opened this issue Aug 16, 2023 · 205 comments
Labels
bug Something isn't working
Milestone

Comments

@yazend
Copy link

yazend commented Aug 16, 2023

I have updated to dev.3.7.0 as a test,
after a certain time n 1-3 hours the ems-esp is no longer available, no response, no putty, no ping
the whole thing can be reproduced, if I switch back to dev 3.6.17, this behavior is gone . Is this problem known

BR Alex

@proddy
Copy link
Contributor

proddy commented Aug 16, 2023

Strange, haven't seen this behaviour or any other reports. Did you build yourself or take the .bin from the GH release page? Also any other info on your setup? If it's memory you can try capturing the data and seeing what happens over time.

@yazend
Copy link
Author

yazend commented Aug 17, 2023

Proddy,
I'm using the ems-esp with WiFi and LAN, I downloaded the firmware via the UI so I assume the .bin is the same as on GIT.
I just reinstalled the 3.0.7.dev0 and set the syslog to all. As soon as the device does not respond again, I will send the logs.

See different on sysinfo "free RAM"

Sysninfo_V3.0.6dev17.JPG --> free RAM 105KB/61KB
Sysninfo_V3 0 6dev17

Sysninfo_V3.0.7dev0.JPG --> free RAM 107KB/59KB
Sysninfo_V3 0 7dev0

@yazend
Copy link
Author

yazend commented Aug 17, 2023

Now I have the ems-esp freez the situation, if you need further information let me know.

no response anymore...
grafik

log.txt
log.txt

@giovanne123
Copy link

For information:
I had also today 17.08.2023 (~13:45) a freeze, the ems-esp was not reqonding anymore. Have seen it few minutes ago.
Had to cut power to get it running again.

I'm on 3.6.0, release installed on 15.08.2023 (19:40) .bin from GH, so was running for ~2days until freeze.

3.5.x: Never had such freeze before the last years with previous versions like 3.5.x
3.6.x: Started with 3.6.x/dev some days ago:

  • 3.6.0-dev_14 installed on 08.08.23 without issue until
  • 3.6.0-dev_17 installed on 10.08.23 without issue until
  • 3.6.0 installed on 15.08.23 --> freeze today

3.7.x Haven't installed 3.7.x/dev so far

Will observe if it will occur again...

image

@proddy
Copy link
Contributor

proddy commented Aug 17, 2023

Thanks for reporting. I'll compare both the old 3.6.0-dev17 with the latest 3.6.0 to try and see what is different

@proddy
Copy link
Contributor

proddy commented Aug 18, 2023

update: I can't see any noticeable changes in the code between 3.6.0_dev.17 and either 3.7.0-dev-0 or the main 3.6.0

are there any specific actions you are performing when it 'freezes'? Like is the Web UI open and logging everything to the System Log window? Or does it crash when there is no web UI or telnet present?

@giovanne123
Copy link

from my side there was no activity in WebUI or telnet.
I only sometimes have a look via HA to it (HA via mqtt from ems-esp). But nothing special was done when the freeze occured.
Since yesterday it is running:
image

Will observe how it behaves the next days and if there will be a freeze again...

@vmonkey
Copy link
Contributor

vmonkey commented Aug 19, 2023

Hi, I have the very same issue with 3.6.0.

@proddy
Copy link
Contributor

proddy commented Aug 19, 2023

damn. Is there anything particular to your setups? Like are using the API a lot (e.g. via iobroker), or exporting data from the web, or having the Web UI open looking at the web logs, or logging with syslog. I need to trace down why it's not working for you guys.

In the meantime can you try out these two builds:

EMS-ESP-3_6_0-dev_17-ESP32_4M.zip

EMS-ESP-3_6_0-dev_18-ESP32_4M.zip

@proddy proddy added this to the v3.6.0 milestone Aug 19, 2023
@proddy proddy added the bug Something isn't working label Aug 19, 2023
@proddy proddy changed the title no response from ems-esp EMS-ESP 3.6.0 crashes after hours/days - no response from ems-esp Aug 19, 2023
@vmonkey
Copy link
Contributor

vmonkey commented Aug 19, 2023

@proddy actually, nothing really special. I am just using it in home assistant via mqtt.

I would be really happy to help but I will not have a physical access to the device for few days. If the issue is not found, I will get back to this on Thursday.

Thanks a lot!

@giovanne123
Copy link

giovanne123 commented Aug 19, 2023

In the meantime can you try out these two builds:

EMS-ESP-3_6_0-dev_17-ESP32_4M.zip

EMS-ESP-3_6_0-dev_18-ESP32_4M.zip

Currently after last restart 3.6.0 is running without freeze since 2 days now...
I will wait until tomorrow before flashing one of your linked builds to see if there won't be a freeze until tomorrow with 3.6.0...
What is changed in the linked builds? (There was already dev17 before)?

Edit 21.08.2023:
Today morning 8:30 after 3 days 3.6.0 crashed again.

Edit 23.08.2023 ~11:30:
3.6.0 freeze again after 2days
Now 12:45 installed dev18...

Edit 25.08.2023 ~6:50:
dev18 freeze after ~2 days, restarted device again with dev18...

Edit 29.08.2023 15:30:
Installed 3.7.0-dev.1 because was pointed out below., will check.... (dev18 wasn't freezed since last comment)

Edit 29.08.2023 18:45:
revert back to dev18, because have some other trouble with 3.7.0-dev.1, not related to our issue analysing here... (thermostat data and maybe other are not published/send to mqtt, e.g. damped outdoor temp was missing in HA -> no time to investigate currently so going back to dev18 for a running system)

@proddy
Copy link
Contributor

proddy commented Aug 20, 2023

There should be no difference between the dev-17, dev-18 and main 3.6.0 other than some update web libraries, but that shouldn't cause a crash. Which is strange.

@yazend
Copy link
Author

yazend commented Aug 20, 2023

There should be no difference between the dev-17, dev-18 and main 3.6.0 other than some update web libraries, but that shouldn't cause a crash. Which is strange.

I update second device cross the street dev-17 is working, dev-18 as well 3.7.0-dev-0 freez System over time.
Two different device my wifi and lan, other one at my parents only wifi.

@proddy
Copy link
Contributor

proddy commented Aug 20, 2023

There should be no difference between the dev-17, dev-18 and main 3.6.0 other than some update web libraries, but that shouldn't cause a crash. Which is strange.

I update second device cross the street dev-17 is working, dev-18 as well 3.7.0-dev-0 freez System over time. Two different device my wifi and lan, other one at my parents only wifi.

Ok! Thanks for testing this, it's super helpful. So something sneeked in between dev-17 and dev-18 so I'll compare the source for those two again and try and figure out what is causing this.

@kebabuschi
Copy link

Same issue on my device, too. ESP32, wifi. Updated from 3.5.x latest official to 3.6.0 official. I have 8 digital thermometers attached via one wire. I only use the Web GUI, for OTA updates or customizations. Apart from that I use Home Assistant via MQTT.

@JokerGermany
Copy link

JokerGermany commented Aug 20, 2023

Same issue after a little bit more than 25 hours, was runnning fine with EMS-ESP-3_6_0-dev_12-ESP32.bin will try the new dev and if this isn't working will go pack to 3_6_0 dev.
(It crashed with the dev version seldom too, but it recovered after some time.)

ESP32 with Lan but deactivated (to save memory) and using wifi instead and HA
grafik

€dit: This time it recovered before beeing 1 hour not available and then crashed again...
€dit2: Can't get it back to life at the moment... It only shows a white screen when trying to reach the webinterface. But it looks like some sort of connections is still working because firefox is trying to connect.

@proddy
Copy link
Contributor

proddy commented Aug 20, 2023

Sorry @JokerGermany to screw up your system. There is something fishy with the 3.6.0 main build but I'm not sure what it is. Could you try one of the dev17 or dev18 builds from the earlier posts?

@JokerGermany
Copy link

JokerGermany commented Aug 20, 2023

Sorry @JokerGermany to screw up your system. There is something fishy with the 3.6.0 main build but I'm not sure what it is. Could you try one of the dev17 or dev18 builds from the earlier posts?

will do when i get it back to live. at the moment the web interface isn't reachable at all.
Will try to cut off power.

€dit:
okay downgraded to EMS-ESP-3_6_0-dev_18-ESP32_4M.zip

€dit2:
Looks like same "white screen interface" with dev 18, but don't know if 3.6.0 saved something which let dev 18 crash, too.
Will now try EMS-ESP-3_6_0-dev_17-ESP32_4M.zip

€dit3:
After power off/on dev 18 is working now, will test it before trying dev 17.
MQTT looks better now, after the downgrade it was connected but in the queue were 90 and orange....

@yazend
Copy link
Author

yazend commented Aug 21, 2023

Proddy,
I agree with you, this is a very strange behavior,
I changed the mqtt setting, since then it has been running without problems, whether 3.0.6-dev-17 or dev-18 also the 3.0.7-def-0
all run without freeze, since I changed the settings.

mqtt setting with freez
User: mqtt
Password: mqtt
now I changed it according to the homeassistant mqtt information

mqtt settings without freez
user: homeassistant
Password: hvxrdatj6t3r8oggg3689jhrcgs268kjv (random PW)
as side efect Homeassistant Information disapered mqtt sensor name changed to none.

I did it as well at my parents device, both are running since I changed the mqtt,
I don't know if this helps to find the different on your code, but someone can use it as an work around.

@hlavki
Copy link

hlavki commented Aug 21, 2023

I have the same issue using HomeAssistant via MQTT on Wifi and Buderus Logamax plus GB192i

@JokerGermany
Copy link

JokerGermany commented Aug 22, 2023

Feedback:
dev18 is working for me since 1 Day and 18 hours.
Had an outage yesterday for a 20 minutes, but i had this with the other dev version before, too.
grafik
They are acceptable for me as long as they are short as this one and ems-esp is recovering on his own.

@vmonkey
Copy link
Contributor

vmonkey commented Aug 24, 2023

@proddy Before leaving for vacation, I reverted to dev18. It has been stable since then (5 days), so I believe the issue has been introduced afterwards.

@giovanne123
Copy link

@proddy , for me dev18 also freezed after ~2days (#1264 (comment))

@yazend
Copy link
Author

yazend commented Aug 25, 2023

Proddy,
just for your information...
since I changed the mqtt settings,
3.7.0dev0 is running 7 days and 17 hours

Screenshot_20230825_070740_Samsung Internet

@JokerGermany
Copy link

Feedback: dev18 is working for me since 1 Day and 18 hours. Had an outage yesterday for a 20 minutes, but i had this with the other dev version before, too. grafik They are acceptable for me as long as they are short as this one and ems-esp is recovering on his own.

crashed 6 hours ago now, restarted it, to look if it was a fluke

Proddy, just for your information... since I changed the mqtt settings

What do you exactly changed?

@yazend
Copy link
Author

yazend commented Aug 26, 2023

I changed the mqtt username and password, see six post above

@MichaelDvP
Copy link
Contributor

I think the freezes are memory related. If mqtt disconnects the queue is filled consuming memory until a minimum is reached,

  1. this
    #define EMC_MIN_FREE_MEMORY 16384
    is to small, i think in v3.5 we had ~40k as minimum free before deleting queued messages.
  2. We queue messages while mqtt is disconnected
    #define EMC_ALLOW_NOT_CONNECTED_PUBLISH 1
    in v3.5. publish/queue was stopped on disconnect.

Maybe better to change both values

@proddy
Copy link
Contributor

proddy commented Oct 1, 2023

releasing 3.6.2 - please report back if these freeze type scenarios return

@proddy proddy closed this as completed Oct 1, 2023
@JokerGermany
Copy link

JokerGermany commented Oct 1, 2023

3.6.2 dev2 don't respond since 1 hour (webgui and mqtt)
uptime was 008+22:45:30.994
grafik
grafik
grafik
grafik
grafik
grafik
grafik
grafik
grafik
grafik
grafik

@JokerGermany
Copy link

restarted it and updated to 3.6.2

@Roger954
Copy link

Roger954 commented Oct 1, 2023

I also have issues, my S32 (standard wifi, v2.0) also crashed a few times with 3.6.0 the last week. I’m using it with Home Assistant, all up-to-date latest software releases (2023.9.3, mosquito mqtt 6.3.1) connected to a Nefit/Bosch heatpump.
Just upgraded from 3.6.1 to 3.6.2. I switched the wifi between 3 available access points. When switching to the most distant one (-85dBm), the S32 crashed: no LED light. Removing power resolved that. Now unable to repeat this. Hope this helps.

@proddy proddy reopened this Oct 1, 2023
@gwilford
Copy link

gwilford commented Oct 1, 2023

I was running 3.6.2-dev.2 for over 9 days with hardcoded BSSID without a crash or reboot (a recent record) until this morning when our mesh WiFi system performed a maintenance reboot. The ESP then hung until I power cycled it.

However (and this must be a separate issue), when it came back up it had reverted to 3.6.1. I just uploaded 3.6.2, verified it was running 3.6.2 and rebooted and now I'm back at 3.6.1 again...?

@MichaelDvP
Copy link
Contributor

However (and this must be a separate issue), when it came back up it had reverted to 3.6.1. I just uploaded 3.6.2, verified it was running 3.6.2 and rebooted and now I'm back at 3.6.1 again...?

Did you use the official 3.6.2? I've seen this before in one of my first tests for power-entities to nvs. I think ota_data was corrupted and it boots to first partition always. But in later versions i can't reproduce (i think is was a nvsRead/Write before nvsBegin in first version). I tested now with the official 3.6.2 and still can not reproduce, all reboots go to right partition.
Please open a new issue if you can reproduce with latest build.

@gwilford
Copy link

gwilford commented Oct 2, 2023

However (and this must be a separate issue), when it came back up it had reverted to 3.6.1. I just uploaded 3.6.2, verified it was running 3.6.2 and rebooted and now I'm back at 3.6.1 again...?

Did you use the official 3.6.2? I've seen this before in one of my first tests for power-entities to nvs. I think ota_data was corrupted and it boots to first partition always. But in later versions i can't reproduce (i think is was a nvsRead/Write before nvsBegin in first version). I tested now with the official 3.6.2 and still can not reproduce, all reboots go to right partition. Please open a new issue if you can reproduce with latest build.

Yes, it was the official 3.6.2. I tried something different this morning - flashed 3.6.2 while running 3.6.2. This made the change to 3.6.2 persist through a manual reboot. Previously, I had flashed 3.6.2 (the the devs) after it had rebooted back to 3.6.1...

@MichaelDvP
Copy link
Contributor

flashed 3.6.2 while running 3.6.2. This made the change to 3.6.2 persist

It seems so. The ESP have two partitions app0 and app1, one of them is active. An update is written to the other partition and than this partition is set active. If you have same software in both partitions you'll not see a fallback to wrong partition, because both have the same software. Try to flash 3.6.3-dev0, it is the same as 3.6.2 but different version number and see if it is reset persistant.

@gwilford
Copy link

gwilford commented Oct 2, 2023

flashed 3.6.2 while running 3.6.2. This made the change to 3.6.2 persist

Try to flash 3.6.3-dev0, it is the same as 3.6.2 but different version number and see if it is reset persistant.

Flashed 3.6.3-dev.0 and it persisted after a system restart so I believe the boot partition numbering issue has gone away for me.

@proddy
Copy link
Contributor

proddy commented Oct 7, 2023

3.6.2 isn't responding after uptime: 004+21:21:13.901

grafik

check your router to see if EMS-ESP is on your Wifi network. I think EMS-ESP is running, but has weak network connectivity. You can also look at the LEDs on EMS-ESP - Flashing = bad

@MichaelDvP
Copy link
Contributor

My main esp32 is now:

    "version": "3.6.2-dev.2",
    "uptime": "016+17:32:10.141",
    "free mem": 150,
    "max alloc": 71,
    "MQTT publishes": 759749,
    "MQTT publish fails": 2,
    "temperature sensor reads": 2023527,
    "temperature sensor fails": 2,
    "bus telegrams received (rx)": 2930571,
    "bus reads (tx)": 393867,
    "bus incomplete telegrams": 56,
    "bus reads (tx)": 393867,
    "bus writes (tx)": 99,

I don't think it is something in software.

@JokerGermany You have a lot of rx fails, pointing to a weak power supply or emc problem. Check with different supply
It is a custom board, check connections for bad contacts and cold solder.
Is the bssid now set? We had a case in the past with mesh and one of the APs have different password.
You use IPv6 and fixed IP, are all settings IPv6, and/or FQDNs? If using FQDN, is the DNS server always reachable?
Try to set ems-esp to standard settings, maybe one of the options does not work well in this combination.
It seems to me that you are the only one left with these "freezes".

@Th0maz
Copy link

Th0maz commented Oct 7, 2023

image
... after 12 days :(
Screenshot from yesterday:
Screenshot_2023-10-06-19-11-07-03_40deb401b9ffe8e1df2f1cc5ba480b12~2

My router says the esp is still connected to wifi, although it does not respond to ping.
Interface is S32 from bbqkees.
LED is on, not blinking.
Wifi connection is good (green).
Same IP is assigned every time by Fritzbox DHCP, not static.

I will see if I can get some additional information from the mqtt telegrams written to database and external syslog tomorrow.

@proddy
Copy link
Contributor

proddy commented Oct 8, 2023

I'll create a new issue for this. This issue is about the EMS-ESP crashing which has been fixed (was a mqtt memory issue on network loss)

@vmonkey
Copy link
Contributor

vmonkey commented Oct 9, 2023

@proddy I am not 100% convinced it is fixed; Unfortunately, I had another crash on 3.6.2 after many days of running fine. After restarting, I set BSSID to the nearest AP. If I ever observe another crash, I will let you know about it in the other issue.

Many thanks for your help and also excellent project!

@proddy
Copy link
Contributor

proddy commented Oct 9, 2023

I categorize an EMS-ESP crash as a shutdown with a stack dump followed by an automatic restart, usually related to out of memory. I think in your case it just becomes unresponsive.

@proddy proddy reopened this Oct 9, 2023
@proddy
Copy link
Contributor

proddy commented Oct 10, 2023

I don't think it's crashing, but rather EMS-ESP is going into an infinite re-connect loop after the MQTT connection was dropped. There's another issue for this. #1321

@proddy proddy closed this as completed Oct 10, 2023
@MichaelDvP
Copy link
Contributor

@JokerGermany

It uses the power from the ems.

ems-bus has not enough power for a esp32, you have to use service-jack or external supply. Low power could explain the high rx-fails and unstable network.

It suppose the led was flashing and therefore it was removed...
(LED is deactivated in the GUI)

hide-LED only disables the permanent led, not the blinking.

@tp1de
Copy link
Contributor

tp1de commented Oct 10, 2023

I had another WiFi disconnect now as well.

During my 2 weeks vacation period I installed the special version from @MichaelDvP 3_6_1-dev_0f-ESP32_S3 with energy entities but without the changes for Wifi. For a period of 16 days this version was stable.
(I had instabilities before with the latest 3.6.2 dev versions as well)

After return I installed the latest version 3.6.3-dev.1 for the ESP32-S3.
Now after 3 1/2 days Wifi disconnected and mqtt stopped sending data and api was not reachable anymore.
The blue LED was on (not blinking) but my router shows the disconnect.
WiFi was all the time very strong around -30 dbm and no memory leak was seen in HA statistics.

The router log shows 3 disconnects within 10 seconds but no new connect. Afterwards nothing anymore.
Just after hard reset the new connect happens and is shown in the router log.

It looks like that there has been 2 reconnect attempts after the first connection loss but without success.
1st one after 5 seconds and 2nd one after 10 seconds.
The reconnect attempt was to another Mesh access point with - 40dbm rssi - I did not configure BSSID.

Open questions are: 1. why was the connection lost? and 2. why is the reconnect not working?

I can see a high number of disconnecting / reconnecting activities within my routers log for most devices.
The ems-esp gateway was absolute stable connected using 3.6.1-dev.0f from Michael. Before using the latest 3.6.2-dev I had a lot of reconnects (after 5 seconds) but some connection losses too. --> log file

wlanlog.txt

@MichaelDvP
Copy link
Contributor

Have you tried to set the bssid? Networkscan and click the AP sets it automatically.
I have written my theorie in #1321, it's getting confusing to have #1264, #1295, #1321 and #1324 with the same issue.
@proddy could we close all except one and discuss in one topic? I'm now sure it's a mesh issue.
@tp1de in the first session the AP playing pingpong, the ems-esp disconnects from one and connects to the (temporary) stronger one and vice versa. The ems-esp initiates the disconnect as "WLAN-Gerät hat sich abgemeldet", but in second sessing the mesh disconnects "WLAN-Gerät wurde abgemeldet" from all APs and it seems the ems-esp does not know it is disconnected and still shows connected. I don't know why, but with a single AP it does not happen, it's mesh specific nad initiated from the mesh, not from esp. In my testbuild i try to reconnect on any possible message the esp framework gives, but not sure it caches this case.

@JokerGermany
Copy link

JokerGermany commented Oct 10, 2023

@JokerGermany

It uses the power from the ems.

ems-bus has not enough power for a esp32, you have to use service-jack or external supply. Low power could explain the high rx-fails and unstable network.

It suppose the led was flashing and therefore it was removed...
(LED is deactivated in the GUI)

hide-LED only disables the permanent led, not the blinking.

Yes i am using the service jack.

Pls forgot my report, i will change my messages. I can't exactly tell if someone teared it down...

Sorry for the false report

@tp1de
Copy link
Contributor

tp1de commented Oct 10, 2023

@MichaelDvP Mesh steering is not active for ems-esp. But there are dynamic bandwidth changes from 40 to 20 Mhz, which might initiate a reconnect.
Nevertheless, the former version was stable, the actual one was not.
When the local ap changes it's bandwidth then it will become unavailable for a short time. ems-esp then connecting to the next strongest ap makes sense. But loosing WiFi connection to any ap and not recognizing is not correct. How many reconnect attempts are configured and how long is the reconnect-interval?
I tried BSSID some weeks ago but I had the same issue.

Shall we continue in #1321 ? And shall I try one of the builds?

@Bingo2023
Copy link
Contributor

2023-10-10_19h55_38
want to give some positive feedback, on my side it is running stable for 16d (and continues...), before that it was crashing every <6h.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests