Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8266: WLED keeps rebooting after 0.14.1 update. #3685

Open
1 task done
Trevo525 opened this issue Jan 14, 2024 · 147 comments
Open
1 task done

8266: WLED keeps rebooting after 0.14.1 update. #3685

Trevo525 opened this issue Jan 14, 2024 · 147 comments
Labels
bug major This is a non-trivial major feature and will take some time to implement needs investigation The bug has not yet been reproduced by me. Analysis or more details are needed.

Comments

@Trevo525
Copy link

What happened?

I have two instances of WLED running on two separate ESP-12F (I believe they are 8266 based?) modules. To be specific, it's this module (not the esp32, obviously). They are wired with different types of LEDs. One is with a WS2812B LED Strip and the other is a more generic LED string that has R|G|B|12V as the inputs, as opposed to 5V|Data|Ground that the first has. I'm not sure that will make a difference. But, I included it as it might be important to note. I just got them both running a week or two ago with WLED 0.14.0 and added them to Home Assistant. Everything worked as expected, I have been using presets and playing with the effects and colors on both. I even have a

However, I updated to 0.14.1 today and the ESP connected to the generic LED strip started turning off when I changed the color it will do that for a split second and I'll notice that the light will switch back to the default orange color. So, I kept testing and it kept happening. Then, I noticed that for a split second after this happens the web interface will be unresponsive for a moment. This leads me to believe the light is restarting.

I have been able to fix this for now by going to the update section and giving it the 0.14.0 interface. But, if I can give any assistance in finding this issue feel free to reach out and I will put 0.14.1 back on it if there is any form of logs or anything I can provide.

To Reproduce Bug

Update to 0.14.1
Press most any button in the interface.

Expected Behavior

I would have expected it not to crash.

Install Method

Binary from WLED.me

What version of WLED?

WLED 0.14.1

Which microcontroller/board are you seeing the problem on?

ESP8266

Relevant log/trace output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@Trevo525 Trevo525 added the bug label Jan 14, 2024
@AKHwyJunkie
Copy link

In the FWIW department, I'm also seeing this same behavior in Athom bulbs as well. (I'm using the recommended ESP02 image, happens across all bulb models.) In case it helps, I noticed this issue started in 0.14.1-B3 and did not occur in 0.14.1-B2, at least in my case. I figured this might have been related to the JSON buffer lock issue, but it looks like not. I can trigger it by changing profiles, either via the web interface or via Home Assistant. I don't believe it's configuration related as I tried a full factory reset in B3.

@chertvl
Copy link

chertvl commented Jan 15, 2024

Same with 8266.
Continuously goes to Unavailable

Screenshot_20240115-064150_Home Assistant

@AngusMcT
Copy link

Have the same problem. Just updated through Home Assistant, and have the same symptoms as OP.

@blazoncek
Copy link
Collaborator

Please remove Home Assistant integration and see if the problems persist.
If they don't you may want to upgrade to ESP32 or get a special build without various features to get more free RAM on ESP8266.

BTW one way to see if WLED restarted is in Info dialog, Uptime field.

@dosipod
Copy link

dosipod commented Jan 15, 2024

I do not use esp8266 ( 4MB , 2MB or 1MB ) in production setup but i do have a lot of them around to replicate such issues . If cfg.json and preset.json are provided then we could do so .

I have flashed two esp8266 4MB units since the first hour of 0.14.1 release and kept them
with debug bins , i did not notice anything strange nor seen disconnection/reboot/crash in the log .

As of 1 hour ago i have added one of them to HA with a simple automation ( to actually only send alert if the unit is on/off ) and i can see the unit disconnecting from wifi ( ping is lost ) but could not get it to constantly behave the same way .

I blame HA integration but can not confirm

@blazoncek
Copy link
Collaborator

@chertvl down-voting will not help resolving the issue.

@Doyle4
Copy link

Doyle4 commented Jan 15, 2024

Running fine on ESP32 S2 mini, will test on a esp8266 device later when I can.

@chertvl
Copy link

chertvl commented Jan 15, 2024

@chertvl down-voting will not help resolving the issue.

Nevermind. Already downgraded to 0.14.0 and thats works perfectly.

About "not help resolving issue", its:

  • Advise to change the electronic component of the device, without thinking that this is a ready-made factory device where this is impossible
  • Sin on integration. Which was first deactivated during debugging.
  • Advise not to use usermods. But they don’t exist anyway. In my case, this is a regular clean 0.14.1, which was updated via HA. and HA does not know how to update firmware with usermods. If I'm not mistaken....

I now have more time to describe the symptoms.
After updating an 8266-based device using HA from version 0.14.0 to 0.14.1:

  • The WLED web page takes forever to load, sometimes some elements will be drawn, but very rarely, most often the error is err_connection_refused.
  • APIs do not work, including HA integration.
  • It can be seen that the device reboots every few minutes, and could not turn on normally. He's missing something, maybe memory.
  • The router reports that the device is connected, the uptime is stable, there are no reconnections.

@mxilievski
Copy link

Same here, updated 3 8266-based devices. They can’t be accessed via Web.

@Doyle4
Copy link

Doyle4 commented Jan 16, 2024

How many LED's you guys using? Flashed a couple esp8266's from B3 to released 0.14.1, no more than 100 led's working fine, BUT I don't use H.A at all so I can't help on that side sorry.

@photobix
Copy link

Same problem on 4 instances. Between 80 and 278 LED on WEMOS D1 Mini (8266).
Even an update no longer works without any problems OTA, I had to flash 3 instances via USB. Apparently, the update runs into a timeout.

@WarC0zes
Copy link

Same problem on Atom Matrix.I use home assistant and a RESTful command.
Since updating to version 0.14.1, I receive this error.

Logger: homeassistant.components.rest_command
Source: components/rest_command/__init__.py:166
Integration: RESTful Command ([documentation](https://www.home-assistant.io/integrations/rest_command), [issues](https://github.com/home-assistant/core/issues?q=is%3Aissue+is%3Aopen+label%3A%22integration%3A+rest_command%22))
First occurred: 06:19:45 (13 occurrences)
Last logged: 10:33:48

Client error. Url: http://192.168.1.xx/json/state. Error: Server disconnected

I reverted to version 0.14.0 and I no longer have errors.

@mxilievski
Copy link

Same problem on Atom Matrix.I use home assistant and a RESTful command. Since updating to version 0.14.1, I receive this error.

Logger: homeassistant.components.rest_command
Source: components/rest_command/__init__.py:166
Integration: RESTful Command ([documentation](https://www.home-assistant.io/integrations/rest_command), [issues](https://github.com/home-assistant/core/issues?q=is%3Aissue+is%3Aopen+label%3A%22integration%3A+rest_command%22))
First occurred: 06:19:45 (13 occurrences)
Last logged: 10:33:48

Client error. Url: http://192.168.1.xx/json/state. Error: Server disconnected

I reverted to version 0.14.0 and I no longer have errors.

How did you revert?

@WarC0zes
Copy link

WarC0zes commented Jan 16, 2024

Same problem on Atom Matrix.I use home assistant and a RESTful command. Since updating to version 0.14.1, I receive this error.

Logger: homeassistant.components.rest_command
Source: components/rest_command/__init__.py:166
Integration: RESTful Command ([documentation](https://www.home-assistant.io/integrations/rest_command), [issues](https://github.com/home-assistant/core/issues?q=is%3Aissue+is%3Aopen+label%3A%22integration%3A+rest_command%22))
First occurred: 06:19:45 (13 occurrences)
Last logged: 10:33:48

Client error. Url: http://192.168.1.xx/json/state. Error: Server disconnected

I reverted to version 0.14.0 and I no longer have errors.

How did you revert?

I downloaded the firmware (.bin) in version 0.14.0.
After you connect to the esp through the browser.
In setting / security and update, and click on manual OTA update.
wled update
You select the firmware and update.

@softhack007
Copy link
Collaborator

softhack007 commented Jan 16, 2024

I now have more time to describe the symptoms. After updating an 8266-based device using HA from version 0.14.0 to 0.14.1:

  • The WLED web page takes forever to load, sometimes some elements will be drawn, but very rarely, most often the error is err_connection_refused.
  • APIs do not work, including HA integration.
  • It can be seen that the device reboots every few minutes, and could not turn on normally. He's missing something, maybe memory.
  • The router reports that the device is connected, the uptime is stable, there are no reconnections.

@blazoncek a few thoughts on commonalities in user reports

  • Its seems to only affect 8266 ("Running fine on ESP32 S2")
  • the only real change for 0.14.1 is the modified locking mechanism for WebSocket API
  • some people said that problems disappeared with -DWLED_DISABLE_WEBSOCKETS
  • some problems include WDT reset (watchdog = potential infinite loop)
  • also web responses are sometimes affected ("takes ages")

We have to remember that WS responses are not running in arduino context; on esp32 they run inside the async_tcp task, not sure how its implemented on 8266.

I think there are a few dangerous lines in the code to lock the JSON buffer

while (jsonBufferLock && millis()-now < 1000) delay(1); // wait for a second for buffer lock

  • delay() does work on esp32, however is dangerous on 8266 when not in arduino context
  • on 8266, millis() does not advance outside of arduino context

@chertvl @WarC0zes @Doyle4 if my understanding is right, it could help if you comment out the line I quoted, and replace it with

    if (jsonBufferLock) return false;

its a temporary hack and not a proper solution, but it should help to understand if using delay() and millis() on 8266 is the problem. If this hack helps, then I'll take some time the next days to implement a proper solution for requestJSONBufferLock() without busy-waiting.

@softhack007
Copy link
Collaborator

softhack007 commented Jan 16, 2024

🔺 On a different topic that goes to all who commented and contribute to this thread:

Please stop this thumbs-up thumbs-down BS. We are trying to analyse a problem and need you as users who must help us.
It does not really help if you just express fuzzy feelings with thumbs.

image

image

We are trying to do engineering work here, not to entertain fans in the roman circus.

  • In case you want to add your few cents, please write a sentence in Englisch, following basic rules of grammar.
  • If someone wants to say that he cannot even disable HA integration for a test, please write that.
  • a written "same here, too" is a lot easier to understand, instead of giving a thumbs-up to "same here".

I'm really tired of playing guessing games with emoji.

Use words, instead of throwing tags onto the wall. please.

@softhack007 softhack007 changed the title WLED keeps rebooting after 0.14.1 update. 8266: WLED keeps rebooting after 0.14.1 update. Jan 16, 2024
@asolochek
Copy link

I noticed this same behavior on my athom rgbw controller which is paired to home assistant.

After upgrading earlier in the afternoon everything seemed fine, but when I went to turn my lights off I noticed the wled controller wasn't responding. I tried a few times to turn them off via home assistant, and somehow got it stuck in a reboot loop that caused the leds to blink off every 30 seconds or so.

I was able to stop this by turning them off via the web UI and reverted to 0.14.0 and it's working again.

@chertvl
Copy link

chertvl commented Jan 16, 2024

@chertvl @WarC0zes @Doyle4 if my understanding is right, it could help if you comment out the line I quoted, and replace it

Thanks for the detailed explanation.
I tried to compile the firmware for the first time using these instruction at
https://kno.wled.ge/advanced/compiling-wled/

I followed your steps, commented out the required line, and added a new one. It seemed like I did everything right, but, unfortunately, it didn’t help.
The web interface still cannot load properly, or does not load at all. Sometimes it’s possible to view the status via JSON. The physical button control on the board works.
The behavior has not changed.
ps: HA integration was disabled before all of these.

Below are some screenshots:

image
image
image
image
image
image

@chertvl
Copy link

chertvl commented Jan 16, 2024

unfortunately, it didn’t help.

It may have gotten worse.
Now I do not have enough time to update the firmware via OTA, browser gives err_connection_refused.
Last time I miraculously succeeded, but now I don’t.

Unfortunately, my device doesn't have a UART, and I don't have one at home either. So continue the tests without me until I find a UART to restore the device...
Thanks for understanding.

@softhack007
Copy link
Collaborator

softhack007 commented Jan 16, 2024

Now I do not have enough time to update the firmware via OTA, browser gives err_connection_refused.
So continue the tests without me until I find a UART to restore the device...
Thanks for understanding.

Thanks for helping as much as you could 🥇 and sorry about making it worse for you.

About the UART: if gpio 1 and 3 are accessible on your board, then a standard "USB-to-TTL" adapter is all you need. Like this one that's using a CH340G:
https://amzn.eu/d/fZChiyZ

... or this one that's specificially made for "ESP-01S"
https://amzn.eu/d/2CEAFUb

You'll also find them for cheap on ali.

@softhack007 softhack007 added needs investigation The bug has not yet been reproduced by me. Analysis or more details are needed. major This is a non-trivial major feature and will take some time to implement labels Jan 16, 2024
@blazoncek
Copy link
Collaborator

* the only real change for 0.14.1 is the modified locking mechanism for WebSocket API

There were more changes than this. And it is not for websockets but for HTTP requests.
Foremost we added PIO_FRAMEWORK_ARDUINO_MMU_CACHE16_IRAM48 to circumvent full IRAM condition. This may cause slowness in non LED display functions.
Mode blending was introduced in 0.14.1-a1. It can use a lot of memory and CPU on its own.

IMO, and my own testing showed that, new locking mechanism only improved on stability and memory corruption.

* some people said that problems disappeared with -DWLED_DISABLE_WEBSOCKETS

Websockes need plenty of heap. Constantly. Disabling them can only improve things at the expense of stale UI.

* some problems include WDT reset (watchdog = potential infinite loop)

I've seen WDT in non-WLED code. How to avoid it? Have no clue.
Async* stuff (web server and TCP and UDP) are interrupt driven on ESP8266.

* also web responses are sometimes affected ("takes ages")

This may be attributed to a more susceptible WiFi code in newer Arduino core we use with 0.14 (I've posted my own experience in another issue detailing the resolution).

All in all, IMO if you want to run 0.14.x on ESP8266 you need to make a few compromises. Why? Because with only 16kB of RAM available (after boot) it can get crowded rather quickly in the heap.

I am going to post my own ESP8266 configuration I use on ESP01 devices which I have plenty in daily use. Unfortunately that configuration may not work for some people as it strips quite a few features out, but produces reliable and working ESP8266 environment.

[env:esp01_4m]
extends = env:esp01_1m_full
board_build.filesystem = littlefs
board_build.ldscript = ${common.ldscript_4m1m}
board_build.f_cpu = 160000000L
build_flags = ${common.build_flags_esp8266}
  -DPIO_FRAMEWORK_ARDUINO_MMU_CACHE16_IRAM48
  -D LED_BUILTIN=2
  -D WLED_DISABLE_ALEXA
  -D WLED_DISABLE_HUESYNC
  -D WLED_DISABLE_LOXONE
  -D WLED_DISABLE_ADALIGHT
  -D WLED_DISABLE_MQTT
  -D WLED_DISABLE_2D
  -D WLED_DISABLE_PXMAGIC
  -D WLED_USE_UNREAL_MATH
  -D WLED_MAX_BUSSES=2
  -D LEDPIN=2
  -D USERMOD_PIRSWITCH
  -D PIR_SENSOR_PIN=3
  -D PIR_SENSOR_OFF_SEC=60
  -UWLED_USE_MY_CONFIG

My ESP01 use 4MB flash so they can be updated OTA.

If we explore the possibility to swap ESP8266 (in Wemos D1 mini format) with alternate (cheap) device (which I also did) I would recommend Lolin ESP32-S2 D1 mini with 4MB flash and 2MB PSRAM. I've also posted build environments for that elsewhere but the stock WLED doesn't differ much.

And for clarification I will not pursue resolving this issue any more since ESP8266 just does not have enough resources to run smooth everything 0.14 offers. If anybody insists on running fully built 0.14 with external system like Home Assistant, Alexa or Hue and MQTT, I would urge them to reconsider and build special version with other features stripped away.

@softhack007
Copy link
Collaborator

softhack007 commented Jan 16, 2024

@blazoncek thanks for your thoughts, and I completely forgot about "Mode blending" and other additions that really increase RAM and CPU needs.

It seems my idea about requestJSONBufferLock() did not improve it. So agreed, it could be a general issue with low RAM. Even when users see free RAM, it might be fragmented heavily - I've seen examples where the largest availeable block was less than 10% of total free space.

Guess that we need serial monitor logs from debug builds, to find out if something can be done to improve 8266 performances - or maybe nothing can be done, and we'll soon declare 8266 as "half-dead" 😉 aka deprecated....

Edit: a few more "disable" flags to try out:

  • -D WLED_DISABLE_ESPNOW
  • -D WLED_DISABLE_WEBSOCKETS
  • -D WLED_DISABLE_MODE_BLEND

.... and a simple one: go to LEDs settings, uncheck "Use global LED buffer"

@blazoncek
Copy link
Collaborator

Regarding WDT resets: I have received a word from @willmmiles (whom I consider one of the most technically skilled developers that touched WLED code) that he has traced WDT resets into NeoPixelBus code consuming too much time bitbanging data out.

If you are not using GPIO1 or GPIO2 or GPIO3 for digital led output then CPU has to keep feeding LEDs. This in turn reduces performance for everything else.

If you use PWM LEDs make sure you only use GPIO4 or GPIO12 or GPIO14 or GPIO15 (as specified by Espressif technical documentation, https://www.espressif.com/sites/default/files/documentation/esp8266-technical_reference_en.pdf). Do not forget PWM signal requires NMI to be driven, hence uses CPU.

@willmmiles
Copy link
Collaborator

willmmiles commented Jan 17, 2024

Regarding WDT resets: I have received a word from @willmmiles (whom I consider one of the most technically skilled developers that touched WLED code) that he has traced WDT resets into NeoPixelBus code consuming too much time bitbanging data out.

My test case here is a single strip of 110 WS2812Bs, using a 0_15 branch derived build. Bit-banging for this many LEDs can take several milliseconds with interrupts disabled, which I believe can overflow some of the wifi hardware queues, depending on the amount of traffic on the network. I'm working on hacking some of the interrupt tolerance ideas from FastLED in to NeoPixelBus to see if I can mitigate it.

If a setup has more LEDs on a bit-banging pin, or a busier network, it might trip problems sooner. Sometimes this might manifest as hard reboots like I'm seeing; it's also possible it manifests as a wifi disconnect. (I'm actually rather suprised I haven't seen that in my testing, to be honest).

I will try a 0.14.1 build tonight and see if it behaves differently for me than the 0_15 development branch. It's quite possible this is a different issue than the one I've been chasing.

@afflux
Copy link

afflux commented Jan 18, 2024

Regarding WDT resets: I have received a word from @willmmiles (whom I consider one of the most technically skilled developers that touched WLED code) that he has traced WDT resets into NeoPixelBus code consuming too much time bitbanging data out.

FWIW, I'm seeing occasional resets on 8266 with 0.14.1 and use LPD8806, so no bitbanging involved. (But it's way rarer than what people are reporting here, I have 48h uptime right now)

@blazoncek
Copy link
Collaborator

use LPD8806, so no bitbanging involved

how do you know it is not? If you are using GPIO13 & GPIO14 then yes it uses HW to accelerate output otherwise you are using SW (CPU) to drive clock and data.

@Scope666
Copy link

Scope666 commented Sep 10, 2024

@asolochek I'd be interested to know if #3 gives you the same problems as #1 - that'd point to @softhack007 's theory about IRAM.

I'm the opposite though, build 1 ran 25 hours before crashing, and build 3 is past 2 days, but build 2 crashed after 58 minutes.

Build 3:
image

@willmmiles
Copy link
Collaborator

Unfortunately my 0.14.4 debug build running your config+presets hasn't crashed yet. :( If it doesn't give me something to look at by tomorrow, I'll try getting closer to the release code.

I do observe that HA's polling of presets puts a surprisingly large amount of strain on the micro - it blocks the update loop for much longer than I'd've expected. I guess this was enough of a problem that presets are now cached in systems with a large amount of memory. I might take a look later and see if there's something I can do to improve the situation for the rest of us -- at the very least to not glitch the FX rendering so much.

@asolochek
Copy link

I'm at 1 day, 1 hour on build #3. Seems to be working pretty well.

@Scope666
Copy link

Build 3 still going here, 3 days 16 hours. Maybe it's the Neopixel thing...

@softhack007
Copy link
Collaborator

softhack007 commented Sep 11, 2024

Hi @willmmiles

I do observe that HA's polling of presets puts a surprisingly large amount of strain on the micro - it blocks the update loop for much longer than I'd've expected. I guess this was enough of a problem that presets are now cached in systems with a large amount of memory. I might take a look later and see if there's something I can do to improve the situation for the rest of us -- at the very least to not glitch the FX rendering so much.

I think this is a long-existing problem with the HA integration. It also happens on my test system (esp32, 24x32 matrix on 2 output pins). As far as I understood, the presets.json file is directly read from flash then sent to HA. The presets file can grow over 90kb if you have many presets. It looks like interrupts are disabled while reading from flash, which leads to glitches. If the ws2812 transmission gets interrupted for more than ~100 microseconds, this is a reset -> glitch.

I think we should talk to HA developers on how to improve on the problem. If they poll the presets file frequently, maybe we can find a way to tell them "nothing changed, keep the last presets you received". Internally we already have a timestamp indicating when presets were written. Another option could be to store a hash (md5 or similar) in memory, and use this to tell HA "same file you already have".

@veilofsecurity
Copy link

veilofsecurity commented Sep 11, 2024

Are you sure HA is reading presets at any interval? I have not looked at the code but when I add, rename or remove a preset in WLED I have to reload the integration in HA to get it to pick it up. It might do daily or something but I don't think it's doing it at any high frequency.

It definitely does poll entities like status and currently used preset but I don't think the whole preset list is read often.

Edit to add: You mention a matrix config. If you have a ton of segments configured, HA definitely does poll those for status so that could contribute to load.

@Scope666
Copy link

Can confirm you have to reload the integration to pick up changes to the presets list.

@willmmiles
Copy link
Collaborator

As far as I understood, the presets.json file is directly read from flash then sent to HA. The presets file can grow over 90kb if you have many presets. It looks like interrupts are disabled while reading from flash, which leads to glitches.

This is what confuses me, and where I think there might be some room for improvement. The TCP stack can queue only a small amount of data at a time - ~1k on ESP8266, ~5k on ESP32; so there's no need to read more than that at once; so how is it that we're locking up the CPU for so long? Either we're getting interrupted and yield()ing back to system context so often that we don't make progress in the main loop, or there's something really inefficient in the access pattern (like the FS is reading the whole file to seek for each transport buffer fill). Either way, I'm pretty sure we could find a mitigation, and knowing why could offer some opportunities for improving system-wide robustness.

If they poll the presets file frequently, maybe we can find a way to tell them "nothing changed, keep the last presets you received". Internally we already have a timestamp indicating when presets were written.

AsyncWebServer already has some support for HTTP "If-Modified-Since" and "Etag" in the AsyncStaticWebHandler class - we could plug it in and see if HA recognizes it. We might need to persist the timestamp somewhere, though.

Are you sure HA is reading presets at any interval? I have not looked at the code but when I add, rename or remove a preset in WLED I have to reload the integration in HA to get it to pick it up. It might do daily or something but I don't think it's doing it at any high frequency.

Can confirm you have to reload the integration to pick up changes to the presets list.

The web server debug prints definitely show a periodic poll. I also noticed that the integration doesn't update, so there's definitely a bug somewhere if it's causing all this work for no benefit!

@Scope666
Copy link

Scope666 commented Sep 12, 2024

Just a reminder that 0.13.x and 0.14.0 are ROCK solid. The instability happened right after that point. Here's one of mine still on 0.14.0:

image

@veilofsecurity
Copy link

The web server debug prints definitely show a periodic poll.

A poll of the preset list specifically? I know it polls for segments, current preset/playlist selection, brightness, effect, firmware version, etc. All of these correspond to entities with a state. The reload is only necessary for preset and playlist changes which populate a dropdown so we can see what the available options are in the HA UI.

@willmmiles
Copy link
Collaborator

The web server debug prints definitely show a periodic poll.

A poll of the preset list specifically? I know it polls for segments, current preset/playlist selection, brightness, effect, firmware version, etc. All of these correspond to entities with a state. The reload is only necessary for preset and playlist changes which populate a dropdown so we can see what the available options are in the HA UI.

Yes. It's rolled in to the underlying WLED Python API library maintained by the integration author. HomeAssistant calls WLED.update(), which in turn polls both the state and presets. It seems like HA doesn't actually do anything with the preset update (yet??), but it gets fetched every time.

Unfortunately it also looks like we'd need to add caching support to that library -- the aiohttp client doesn't support it internally, but the good news is that aiohttp-client-cache is meant to be a drop-in replacement that should add the feature we want.

@asolochek
Copy link

Build 3 was running for 4 or 5 days for me, but I just saw my lights change unexpectedly and checked and the uptime is now 20 seconds, so It appears to have an issue.

@willmmiles
Copy link
Collaborator

I have yet to log a crash, unfortunately, even with a straight 0.14.4+debug. I'm moving on to a release-type build, in the hopes it'll at least give me something to look at.

@Scope666
Copy link

Scope666 commented Sep 15, 2024

My build 3 is STILL going ... hmmm....
image

@willmmiles
Copy link
Collaborator

Following up: I have still yet to log any reboots on my test board, even with completely clean builds and the reference downloads. :(

Next up I'm going to see about building one of the "known bad" builds with some extra debug flags, in the hopes that someone can maybe collect the serial output? If not, I'll try to rig up something to stash crash dumps on the filesystem.

@asolochek
Copy link

Following up: I have still yet to log any reboots on my test board, even with completely clean builds and the reference downloads. :(

Next up I'm going to see about building one of the "known bad" builds with some extra debug flags, in the hopes that someone can maybe collect the serial output? If not, I'll try to rig up something to stash crash dumps on the filesystem.

I suppose I could hook up a laptop to mine to log serial, but it would be a lot more convenient if there was a way to log over the network. My wled device is up in the crawlspace.

@willmmiles
Copy link
Collaborator

Unfortunately there isn't a way to send the crash dump over the network -- the crash dump print code runs early in the boot process, long before the network is set up -- and more importantly, before the stack is overwritten in any significant way. I think it might be possible to intercept the dump and stash it in RAM where it can be recovered later in the boot sequence, though. It would still have to be promptly stored somewhere else (like the flash filesystem), as it'd consume a substantial amount of RAM.

@Scope666
Copy link

Scope666 commented Sep 20, 2024

Well, for what it's worth, build 3 is STILL up, so at least on my board, I think the crashes are due to the new NeoPixelBus version.

image

@Trevo525
Copy link
Author

I just replaced one of my WLED units and the one that was replaced is one that I was I couldn't upgrade for this ticket. If someone can make clear instructions on how I can use it to dump the necessary logs I can give it a try.

@Scope666
Copy link

So I went downstairs for a snack and noticed the test unit on when it was supposed to be off. It crashed (build 3) after 17 days.

Going to try build 4 now.

@willmmiles
Copy link
Collaborator

At this point I'm thinking none of these test builds are actually "stable" - whatever the issue is, it's lurking at a lower level; it's more that some builds/environments/configs trigger it faster than others.

I've put together some code to hook the crash handler and save the stack trace in the flash for later recovery. I've also thrown in some task and interrupt tracing logic I'd written while debugging the PWM-related crashes earlier this year. If a crash is logged, on the next boot the software will write a 'dump.txt' file with the trace to the local filesystem. The file can be retrieved with the /edit interface. Once a crash dump is logged, the system won't save another one until the 'dump.txt' file is deleted -- so even if it's crashing a lot, it won't wear out the flash.

Unfortunately this build will not catch "hard watchdog" type crashes; the Arduino core logic for debugging those doesn't have a user code hook, and I haven't got to pulling it in and modifying it yet. If that's what's going on, it'll still dump stack to the serial port, but it won't leave a file behind.

WLED_0.15.0-b5_ESP02_test.bin.gz

Lastly: this build is also based on the latest 0.15 tip, which has some other improvements that might improve stability beyond the previous builds -- though also some new logic.

@Scope666
Copy link

Scope666 commented Sep 29, 2024

@willmmiles I've installed your test build. If it crashes and it creates one, I'll share the dump.txt here.

Thanks!!!

PS ... my other 3 units that are running 14.0 have been up since the last power failure ... 31 days and counting, so it's something that changed after that point.

@softhack007
Copy link
Collaborator

I've put together some code to hook the crash handler and save the stack trace in the flash for later recovery.

@willmmiles cool, sounds like something we really should have in WLED. Do you know if the dump.txt trick would also work on esp32? Some people would kill for such a feature ;-)

Unfortunately this build will not catch "hard watchdog" type crashes

Well at least you can detect the restart reason on the next boot, so that watchdog aborts would not go completely unnoticed. Example https://github.com/MoonModules/WLED/blob/63ff7205d61c4bdf7e9b952e392222e46b93e1d6/wled00/wled.cpp#L575-L577

@willmmiles
Copy link
Collaborator

I've put together some code to hook the crash handler and save the stack trace in the flash for later recovery.

@willmmiles cool, sounds like something we really should have in WLED. Do you know if the dump.txt trick would also work on esp32? Some people would kill for such a feature ;-)

I hadn't done any research on ESP32 yet. Looks like core dumps to flash are already a feature of ESP32-IDF, we'd just need to figure out how to turn them on and supply a partition for them to reside in. (For the ESP8266 code I cheated and used the OTA space).

https://docs.espressif.com/projects/esp-idf/en/stable/esp32/api-guides/core_dump.html
https://www.reddit.com/r/esp32/comments/pmefci/esp32_coredump_to_flash_with_arduino_and/
https://community.platformio.org/t/platformio-support-for-esp32-coredump/25141

Unfortunately this build will not catch "hard watchdog" type crashes

Well at least you can detect the restart reason on the next boot, so that watchdog aborts would not go completely unnoticed. Example https://github.com/MoonModules/WLED/blob/63ff7205d61c4bdf7e9b952e392222e46b93e1d6/wled00/wled.cpp#L575-L577

Oh yeah, I've got that in my debug build too. It only goes to the serial port though. I've got the HWDT stack traces enabled in this build too, but they also only go to the serial port. I do think it's possible to upgrade the HWDT debugging logic to stash the trace elsewhere, but it's a bit more work to integrate than the convenient callback hook the Arduino core folks left for the other crash cases.

@kenni
Copy link

kenni commented Sep 30, 2024

WLED_0.15.0-b5_ESP02_test.bin.gz

Lastly: this build is also based on the latest 0.15 tip, which has some other improvements that might improve stability beyond the previous builds -- though also some new logic.

Thanks @willmmiles. I’ve updated one of my Athom LS-4P devices with the new test firmware, but it seems like basic WLED functionality is broken - I can’t even switch colors. Any idea on what is going on?

I updated from 0.14.0 and tried re-flashing and restarting a couple of times with no luck. Downgrading to 0.14.0 restores all functionality.

IMG_0115
IMG_0116

@willmmiles
Copy link
Collaborator

Thanks @willmmiles. I’ve updated one of my Athom LS-4P devices with the new test firmware, but it seems like basic WLED functionality is broken - I can’t even switch colors. Any idea on what is going on?

@kenni Thanks for giving it a try! It sounds like the index page isn't loading completely, so elements are missing and the javascript code fails.

Can you try connecting with a desktop web browser, ideally with the "developer tools" enabled in the network panel? The index page should be 44679 bytes in size. Also please look in /edit for a dump.txt. (You can check /edit even with the old firmware, the filesystem persists across versions).

@kenni
Copy link

kenni commented Oct 1, 2024

@willmmiles The index page seems to be complete to me. The HTTP response header advertises that the content-length of the file is 44679, as you expected. The transferred file has a size of "45kB" according to Chrome and if I look at the content of the file, it ends with "< / html>" on the last line. So it seems complete.

When I access /edit there're only two files available: cfg.json and presets.json.

EDIT: Factory reset fixes the Javascript-issue, so my old configuration apparently isn't compatible with the new version. Restoring the configuration file on the new firmware reintroduces the Javascript error. Downgrading firmware to 0.14.0 and restoring the configuration file works perfectly.

EDIT 2: The cause of the configuration error seems to be the assignment of LED Data GPIO. The correct pin for my controller is GPIO1, and this works in 0.14.0, but selection of that PIN is not allowed in the GUI in 0.15.0-b5.

EDIT 3: Ahh, seems like the stock 0.15.0-b5 doesn't reserve GPIO1... @willmmiles , are you perhaps using GPIO1 for debugging or something else in your build? Any chance you could generate a build where GPIO1 (and GPIO12 for relay) are unused? I can't physically move any wires, as I'm using a factory-made Athom LS-4P all-in-one controller.

@willmmiles
Copy link
Collaborator

EDIT 3: Ahh, seems like the stock 0.15.0-b5 doesn't reserve GPIO1... @willmmiles , are you perhaps using GPIO1 for debugging or something else in your build? Any chance you could generate a build where GPIO1 (and GPIO12 for relay) are unused? I can't physically move any wires, as I'm using a factory-made Athom LS-4P all-in-one controller.

I haven't made any changes to the default pin settings for the esp8266_2m build. The test build is based on the 0_15 tip, which includes a significant change to the bus and pin management past the -b5 tag; I'll review the logic and see if I can find out why pin 1 is disallowed.

@willmmiles
Copy link
Collaborator

I'll review the logic and see if I can find out why pin 1 is disallowed.

Ah, it's not a new thing at all - this build has debug messages enabled, which are sent to the serial port, for which pin 1 is the transmit pin; so WLED reserves it for that purpose.

Unfortunately we don't yet have a good solution for collecting regular debug logs internally for post-mortem storage. The new code here handles only stack traces, and even then they'll also be echoed out the serial port by the Arduino platform code. I hate to say it but your hardware might just not be suitable for software debugging with this build. :( Sorry!

@kenni
Copy link

kenni commented Oct 2, 2024

Ah, it's not a new thing at all - this build has debug messages enabled, which are sent to the serial port, for which pin 1 is the transmit pin; so WLED reserves it for that purpose.

Ok, that was also my conclusion after coming across a comment in the source code mentioning GPIO1 for serial communication when doing a debug build.

Unfortunately we don't yet have a good solution for collecting regular debug logs internally for post-mortem storage. The new code here handles only stack traces, and even then they'll also be echoed out the serial port by the Arduino platform code. I hate to say it but your hardware might just not be suitable for software debugging with this build. :( Sorry!

It's too bad, I'll just cross my fingers that someone else has suitable hardware and will be able to test your builds. I would love to get the esp8266 back in a working state with latest WLED versions. Thanks for all of your time and willingness to fix this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug major This is a non-trivial major feature and will take some time to implement needs investigation The bug has not yet been reproduced by me. Analysis or more details are needed.
Projects
None yet
Development

No branches or pull requests