Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WiFi.status shows connected when connection is lost #5912

Closed
PurpleAir opened this issue Mar 25, 2019 · 33 comments
Closed

WiFi.status shows connected when connection is lost #5912

PurpleAir opened this issue Mar 25, 2019 · 33 comments
Assignees
Labels
waiting for feedback Waiting on additional info. If it's not received, the issue may be closed.
Milestone

Comments

@PurpleAir
Copy link
Contributor

WiFi.status() returns WL_CONNECTED after connection is lost when using WiFi.setAutoReconnect( false );

The included MCVE produces the following output:

SDK:3.0.0-dev(c0f7b44)/Core:2.5.0=20500000/lwIP:STABLE-2_1_2_RELEASE/glue:1.1/BearSSL:6778687

mode : sta(a0:20:a6:0a:a6:96) + softAP(a2:20:a6:0a:a6:96)
add if0
Looking for WiFi ....wlstatus:WL_IDLE_STATUS=0
WiFi.localIP():(IP unset)
.scandone
scandone
state: 0 -> 2 (b0)
state: 2 -> 3 (0)
state: 3 -> 5 (10)
add 0
aid 1
cnt 

connected with TEST, channel 3
dhcp client start...
ip:192.168.43.39,mask:255.255.255.0,gw:192.168.43.45
 connected to TEST
wlstatus:WL_CONNECTED=3
WiFi.localIP():192.168.43.39
wlstatus:WL_CONNECTED=3
WiFi.localIP():192.168.43.39
state: 5 -> 2 (3a0)
rm 0
wlstatus:WL_CONNECTED=3
WiFi.localIP():(IP unset)
wlstatus:WL_CONNECTED=3
WiFi.localIP():(IP unset)
wlstatus:WL_CONNECTED=3
WiFi.localIP():(IP unset)
wlstatus:WL_CONNECTED=3
WiFi.localIP():(IP unset)
wlstatus:WL_CONNECTED=3

The AP was turned off shortly after the connection was established and you will notice the logs show loss of IP but status() returns WL_CONNECTED. This only happens when using WiFi.setAutoReconnect( false )

#include <ESP8266WiFi.h>
#include <ESP8266WiFiMulti.h>

ESP8266WiFiMulti wifiMulti;
boolean connectioWasAlive = true;
const char* statuses[] =  { "WL_IDLE_STATUS=0", "WL_NO_SSID_AVAIL=1", "WL_SCAN_COMPLETED=2", "WL_CONNECTED=3", "WL_CONNECT_FAILED=4", "WL_CONNECTION_LOST=5", "WL_DISCONNECTED=6"};

void setup()
{
  Serial.begin(115200);
  Serial.println();

  WiFi.setAutoReconnect( false );

  wifiMulti.addAP("TEST", "12345678");
}

unsigned long timestamp = millis();
void monitorWiFi()
{
  if (millis() - timestamp > 2000) {
    timestamp = millis();
    Serial.print("wlstatus:");
    Serial.println(statuses[WiFi.status()]);
    Serial.print("WiFi.localIP():");
    Serial.println(WiFi.localIP().toString());
  }
  if (wifiMulti.run() != WL_CONNECTED)
  {
    if (connectioWasAlive == true)
    {
      connectioWasAlive = false;
      Serial.print("Looking for WiFi ");
    }
    Serial.print(".");
    delay(500);
  }
  else if (connectioWasAlive == false)
  {
    connectioWasAlive = true;
    Serial.printf(" connected to %s\n", WiFi.SSID().c_str());
  }
}

void loop()
{
  monitorWiFi();
}
@PurpleAir
Copy link
Contributor Author

As a workaround, using the WiFiEventStationModeDisconnected event (which does fire), performing a disconnect (even though the connection is already disconnected) fixes the WiFi.status() problem.

void onStationModeDisconnectedEvent(const WiFiEventStationModeDisconnected& evt) {
  if (WiFi.status() == WL_CONNECTED) {
    WiFi.disconnect();
  } else {
    Serial.println("         WiFi disconnected...");
  }
}

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 25, 2019

That's a nice workaround for core release 2.5.0.
This bug has disappeared in git master version (soon to be release-2.5.1).
It also not reproducible with fw 2.2.x (SDK:2.2.2-dev(c0eb301)) available for testing with the generic esp8266 board.

edit: If a release is needed, version 2.4.2 hasn't this bug (it uses fw 2.2.1).

@TD-er
Copy link
Contributor

TD-er commented Mar 25, 2019

Could this also be a nice work-around for core 2.4.x issues?

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 26, 2019

@TD-er core-2.4.2 uses fw 2.2.1, same as with current master (by default).
I tried yesterday, when I rebooted my AP the connection was showed as lost.
Do you have a different behaviour ?

@TD-er
Copy link
Contributor

TD-er commented Mar 26, 2019

Well I have not tried this code myself.
But I have noticed a few times (even running core 2.4.0 a while ago) that the reported wifi status is not always reflecting the true status.

So I was hoping this would be some help also, since I have really no clue how to truly detect the connection status.
It is not always happening, but it may happen that there is no connection and the wifi status still reports it is connected.
This then often leads to something waiting for data that will never arrive or starting to initiate a connection. Both end up in a hardware watchdog reset.
I use timeouts where possible, but still see a lot of HW watchdogs happen.

TD-er added a commit to TD-er/ESPEasy that referenced this issue Mar 26, 2019
@PurpleAir
Copy link
Contributor Author

Well I have not tried this code myself.
But I have noticed a few times (even running core 2.4.0 a while ago) that the reported wifi status is not always reflecting the true status.

So I was hoping this would be some help also, since I have really no clue how to truly detect the connection status.
It is not always happening, but it may happen that there is no connection and the wifi status still reports it is connected.
This then often leads to something waiting for data that will never arrive or starting to initiate a connection. Both end up in a hardware watchdog reset.
I use timeouts where possible, but still see a lot of HW watchdogs happen.

I would be hesitant to implement a change as you did in your commit. It is a patch that does not solve the root cause. In my experience, WDT reset are caused by other types of problems. One of the most useful things is the ESP.getHeapFragmentation() recently introduced. It shows the state of memory better than just looking at the available memory from ESP.getFreeHeap(). You may have fragmentation problems. Also, if you are using char arrays, look out for overflows. Those will cause a lot of WDT resets.

@TD-er
Copy link
Contributor

TD-er commented Mar 26, 2019

In the builds with core 2.6.0 I do have the heap fragmentation shown in the sysinfo page and also have been running some tests to see if the fragmentation was an issue.
But these WDT resets do still happen, also on test nodes that hardly do any intensive String manipulation or other stuff regarding memory allocation. (not big parts at least)
These show a very strong correlation with wifi reception quality and the WDT resets happen more often when a node is performing more network IO (thus more chance of running into these issues).

I know it isn't going to solve the root cause of these WDT resets, but it has already taken 100's of hours for me to get some grip on these WDT resets that I really want to try anything.
I was not sure yet if I was going to merge that commit into the main branch. For now that PR has been built and is running on a few nodes as a test.
These WDT resets are bugging me since I think August last year and some of them were definitely related to some bugs in our code or simply running out of resources. So that's just a number of possible reasons for crash reports, but the ones that now still occur do have a very strong correlation with wifi stability, power management, active pings to the node and those kind of things.

I will now have a good look at parts that handle strings to see if there's something that may use char arrays since calling .length() or something like that may indeed take much longer than expected.

Also the inaccuracy of the wifi connected status is also bugging me in other ways.
When trying to send something, I do perform checks on the connection state. So if this state is incorrect, it will cause significant delays in code execution of other parts, which will cause communication to sensors to be out of sync in need of re-init etc.
That's happening on a node I have in my car (offline logging data), running ESP82xx Core 2.6.0-dev, NONOS SDK 2.2.2-dev(c0eb301), LWIP: 2.1.2 PUYA support
Build Time: Mar 20 2019 23:30:57

That SDK should not have this specific issue, right?

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 26, 2019

That SDK should not have this specific issue, right?

I guess not (sdk 2.2.2-dev has not been extensively used though)

@neu-rah
Copy link

neu-rah commented Mar 28, 2019

I'm using 2.5.0 and WiFiEventStationModeDisconnected is not firing unless I do some network activity, webserver sits behind a lost connection waiting for requests, WIFI state 3

the moment esp8266 does a network request, as getting NTP time, then the event fires...

as a check, I keep pinging the router and bang, the event fires as soon as the router drops the connection. However, the ping is too intrusive and generates too much instability especially if using the AP mode.

please advise.

I'm using this kind of ping to maintain net activity, it works as a test and might shed some light on the problem origin...
Is there a better way of having wifi events firing properly for web servers, because unlike sensors we don't do network unless requested.

extern "C" {
  #include <ping.h>
}

void tick_back(void *opt, void *resp) {
  constexpr int ticks_tolerance=3;//how many fails in a row will we tolerate (or something like that)
  static volatile int ticks_ok=0;
  ping_resp* ping_resp = reinterpret_cast<struct ping_resp*>(resp);
  if (ticks_ok>0&&ping_resp->ping_err==-1) ticks_ok--;
  else if (ticks_ok<ticks_tolerance) ticks_ok++;
  wifiConnected&=ticks_ok>0;//update my own state
}

//network tick, not using delay stuff and sending only one packet
void nw_tick(IPAddress dest) {
  static ping_option tick_options;
  memset(&tick_options, 0, sizeof(struct ping_option));
  tick_options.count = 1;
  tick_options.coarse_time = 0;
  tick_options.ip = dest;
  tick_options.sent_function = NULL;
  tick_options.recv_function = reinterpret_cast<ping_recv_function>(&tick_back);
  ping_start(&tick_options);
}

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 28, 2019

@neu-rah See above

@neu-rah
Copy link

neu-rah commented Mar 28, 2019

@d-a-v yes I saw it, but I'm already using the wifi events. Changed to the latest core version:
using platformio, plaktformio.ini set to:
platform = https://github.com/platformio/platform-espressif8266.git (guess it makes it the latest)

and the problem persists, I'm recovering ok from a lost connection as soon as I get WiFiEventStationModeDisconnected, my problem is that it only fires when I do a network request (ping or NTP) until then the connection is on state 3 and no report of connection lost (event not firing).

my esp8266 is primarily a web server and it does not send network requests (unless set manually to do so)...

thanks for your reply 👍 I'm open to any suggestions

@neu-rah
Copy link

neu-rah commented Mar 28, 2019

this ping thing is working for me, instability was due to a bug, I was requesting (and debug printing) it free-wheel on loop, now that I set up a proper 500ms the esp8266 seems stable and I get the events firing as soon as the connection is lost.

still I'm missing a WL_CONNECTING state on WiFi.status(), well i can keep my own track on that of course, but it would be nice.

a view ofesp8266/arduino code gave me no clue on how to solve this, so i guess it is even deeper.

hope it helps

@TD-er
Copy link
Contributor

TD-er commented Mar 31, 2019

@neu-rah Just to be sure, you use that internal ping trick at a 500 msec interval?
And what do you usually ping? The gateway?

@neu-rah
Copy link

neu-rah commented Mar 31, 2019

@TD-er yes, I'm pinging the gateway. Not sure what you mean by "internal", I'm using the pasted code to ping the gateway and yes 500ms, experimented with other values, but this one seems enough and esp8266 seems stable.

@TD-er
Copy link
Contributor

TD-er commented Mar 31, 2019

Most topics about this just suggested to let some host ping the node, which makes it "miraculously" work more stable.
You just let the node itself ping to somewhere.
So it was just to be sure that you were not just letting the nodes ping each other, but that letting a node perform the ping itself was also helping.
Now that it becomes more clear (at least to me) that it is just a matter of not updating the internal wifi status by creating some kind of network activity, it also makes sense that this should indeed help to increase wifi connection stability.

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 31, 2019

@TD-er What are your findings ? #2330(comment)

@TD-er
Copy link
Contributor

TD-er commented Mar 31, 2019

@d-a-v
I am now using the following schema for sending the Gratuitous ARP:

  • When the node receives an IP address
  • When a connection attempt fails
  • When a call to WiFi.hostByName() fails
  • When some node in the ESPeasy p2p network is not seen for > 8 minutes (after 11 minutes they are removed from the list of 'known neighbors')

And on top of that, there is a setting to continuously send these ARP packets with an increasing interval.
This interval is reset to 100 msec on every occasion mentioned above. The max is 5000 msec.

To be honest, I don't see a lot of differences in connectivity between enabling/disabling this last setting.
Activating the "Eco" mode I recently added has a lot more impact on the connectivity.
When this "Eco" mode is active, the scheduler will call delay() when there is nothing to be run. This means the loop count will be (a lot) lower on idling nodes. (still around 300 - 400 loops/sec) This delay is not longer than 5 msec.
But as soon as the power consumption of the node drops, you will see it is missing packets, regardless of the gratuitous ARP packets.
A few pings to that node will "wake" it and it will receive all packets again until it is going to reduce its power again.

In short, I don't think it does help resolving all missing packets, but it is strangely not missing any ping packet. Even the first ping sent when it is in low power mode does get a reply even when it may need a few 100 msec for such a reply. (sometimes up to 700 msec)
So why it is missing other kinds of packets, I have no clue.

@Suraj151
Copy link

facing this issue if i set autoreconnect off and use both wifi mode

WiFi.mode(WIFI_AP_STA);
WiFi.setAutoReconnect(false);

currently below piece of code somehow working. i calling it in loop after every 10 seconds.

void handleWiFiConnectivity(){

  Serial.println( F("\nHandeling WiFi Connectivity") );

  if( !WiFi.localIP().isSet() || !WiFi.isConnected() ){

    Serial.println( F("Handeling WiFi Reconnect Manually.") );

    WiFi.reconnect();

  }else{

    Serial.print(F("IP address: "));
    Serial.println(WiFi.localIP());

  }
}

looking for any better suggestions/solution.

@javierferwolf
Copy link

@TD-er yes, I'm pinging the gateway. Not sure what you mean by "internal", I'm using the pasted code to ping the gateway and yes 500ms, experimented with other values, but this one seems enough and esp8266 seems stable.

good day! neu-rah, can you please tell me how you solved the problem in more detail or can you write the code that you used to solve the problem? I have the problem that the internet service is very unstable and it happens many times that I have connection to the router but there is no internet and in the ESP8266 WDT is activated, but when the internet connection returns I have to manually reset my ESP8266 !! thank you

@neu-rah
Copy link

neu-rah commented Apr 7, 2020

@javierferwolf all code is above, but after some update I had to remove it, guess it or something with same purpose was done into the core. Fell free to experiment with it thou.

@javierferwolf
Copy link

@neu-rah thanks for replying, I don't have much programming experience so I would like to know where in the code? or in which part of the ESP8266WiFi library do I have to put the mentioned code:

extern "C" {
#include <ping.h>
}

void tick_back(void opt, void resp) {
constexpr int ticks_tolerance=3;//how many fails in a row will we tolerate (or something like that)
static volatile int ticks_ok=0;
ping_resp
ping_resp = reinterpret_cast<struct ping_resp
>(resp);
if (ticks_ok>0&&ping_resp->ping_err==-1) ticks_ok--;
else if (ticks_ok<ticks_tolerance) ticks_ok++;
wifiConnected&=ticks_ok>0;//update my own state
}

//network tick, not using delay stuff and sending only one packet
void nw_tick(IPAddress dest) {
static ping_option tick_options;
memset(&tick_options, 0, sizeof(struct ping_option));
tick_options.count = 1;
tick_options.coarse_time = 0;
tick_options.ip = dest;
tick_options.sent_function = NULL;
tick_options.recv_function = reinterpret_cast<ping_recv_function>(&tick_back);
ping_start(&tick_options);
}

@neu-rah
Copy link

neu-rah commented Apr 19, 2020

@javierferwolf #5912 (comment)

i have the include on top of sketch (with the others, if any)

then the functions tick_back and nw_tick somewhere at global scope

and then i call tick_back on the loop giving it the IP address of the gateway, on the main loop i only call this functions every 500ms or so... but the timing was a matter of adjusting.

if i recall it correctly

@javierferwolf
Copy link

javierferwolf commented Apr 23, 2020

hi @neu-rah Thank you for your answers! but sorry I can't understand how I can implement your code in the sketch, as I understand it, the nw_tick funtion i only call 500ms But what do I do with this? for example how could I implement in this simple sketch?

#include <ESP8266WiFi.h>
extern "C" {
  #include <ping.h>
}

const char* ssid = "";
const char* password = "";

boolean wifiConnected = false;

unsigned long previousMillis = 0;        

const long interval = 500;    

void setup() {
  Serial.begin(115200);
  Serial.println();
  Serial.println();  
  connectWifi();  
}


void loop() {  
  
  unsigned long currentMillis = millis();

  if (currentMillis - previousMillis >= interval) {
     previousMillis = currentMillis;
     nw_tick(IPAddress (192,168,1,1));  
    
}

   //here what is needed when the ESP8266 is connected
}

void tick_back(void *opt, void *resp) {
  constexpr int ticks_tolerance=3;//how many fails in a row will we tolerate (or something like that)
  static volatile int ticks_ok=0;
  ping_resp* ping_resp = reinterpret_cast<struct ping_resp*>(resp);
  if (ticks_ok>0&&ping_resp->ping_err==-1) ticks_ok--;
  else if (ticks_ok<ticks_tolerance) ticks_ok++;
  wifiConnected&=ticks_ok>0;//update my own state  
}

//network tick, not using delay stuff and sending only one packet
void nw_tick(IPAddress dest) {
  static ping_option tick_options;  
  memset(&tick_options, 0, sizeof(struct ping_option));
  tick_options.count = 1;
  tick_options.coarse_time = 0;
  tick_options.ip = dest;
  tick_options.sent_function = NULL;
  tick_options.recv_function = reinterpret_cast<ping_recv_function>(&tick_back);
  ping_start(&tick_options);
}

boolean connectWifi() {
  boolean state = true;
  int i = 0;
  WiFi.begin(ssid, password);
  Serial.println("");
  Serial.println("Connecting to WiFi");

  // Wait for connection
  Serial.print("Connecting");
  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
    Serial.print(".");
    if (i > 10) {
      state = false;
      break;
    }
    i++;
  }
  if (state) {
    Serial.println("");
    Serial.print("Connected to ");
    Serial.println(ssid);
    Serial.print("IP address: ");
    Serial.println(WiFi.localIP());
  }
  else {
    Serial.println("");
    Serial.println("Connection failed.");
  }
  return state;
}

thank you Rui!

@yangyud-cn
Copy link

I'm facing the same issue in the latest 2.5.1 SDK.

@d-a-v d-a-v self-assigned this May 14, 2020
@d-a-v d-a-v added this to the 2.7.2 milestone May 14, 2020
@d-a-v
Copy link
Collaborator

d-a-v commented May 20, 2020

@yangyud-cn You should try again with latest release 2.7.1.
I just retried and the issue is gone.

@PurpleAir is it OK to close this issue ?

@d-a-v d-a-v added the waiting for feedback Waiting on additional info. If it's not received, the issue may be closed. label May 23, 2020
@yangyud-cn
Copy link

yangyud-cn commented May 29, 2020

@yangyud-cn You should try again with latest release 2.7.1.
I just retried and the issue is gone.

@d-a-v, I'm using PlatformIO 2.5.1 which is Arduino 2.7.1, according to https://github.com/platformio/platform-espressif8266/releases

Interestingly, enabling DEBUG with ESP8266WebServer actually lower the chance of failure.

#define DEBUG_ESP_HTTP_SERVER
#include <ESP8266WebServer.h>

This makes me suspicious that there could be some timing related issue in network stack code, possibly some race condition.

Here are the log I collected, hope that helps. I also included my monitor and reconnect code.

ping Gateway 192.168.8.67 => 192.168.8.1 ... OK
ping Gateway 192.168.8.67 => 192.168.8.1 ... Fail
Mode: STA
PHY mode: N
Channel: 4
AP id: 0
Status: 5
Auto connect: 1
SSID (10): ZZZZZZZZZZ
Passphrase (12): xxxxxxxxxxxx
BSSID set: 0

ping Gateway 192.168.8.67 => 192.168.8.1 ... Fail

-- Several minutes later The IP lost --

ping Gateway 192.168.8.67 => 192.168.8.1 ... Fail

ping Gateway 192.168.8.67 => 192.168.8.1 ... Wifi not connected : 3

Mode: STA
PHY mode: N
Channel: 4
AP id: 0
Status: 5
Auto connect: 1
SSID (10): ZZZZZZZZZZ
Passphrase (12): xxxxxxxxxxxx
BSSID set: 0
state: 2 -> 0 (0)
Reconn
scandone
state: 0 -> 2 (b0)
state: 2 -> 0 (2)
Wifi not connected : 6
Reconn
Wifi not connected : 6
Reconn
Wifi not connected : 6
Reconn

....

Then I have to fall back to Reboot to recover.

My Ping code:

extern "C" {
  #include <ping.h>
}

static bool _gPingCompleted = true;
static bool _gPingSucceed = false;
static uint32 _gPingRespTime = 0;
static ping_option _gPingOptions;

static void ping_recv_cb(void *opt, void *resp) {
    // Cast the parameters to get some usable info
    ping_resp*   ping_resp = reinterpret_cast<struct ping_resp*>(resp);
    
    // Error or success?
    _gPingSucceed = ping_resp->ping_err != -1;
    _gPingRespTime = ping_resp->resp_time;
    _gPingCompleted = true;
}

bool startPing(IPAddress dest) {
    if (_gPingCompleted) {
        memset(&_gPingOptions, 0, sizeof(struct ping_option));
        
        // Repeat count (how many time send a ping message to destination)
        _gPingOptions.count = 1;
        // Time interval between two ping (seconds??)
        _gPingOptions.coarse_time = 1;
        // Destination machine
        _gPingOptions.ip = dest;
        
        // Callbacks
        _gPingOptions.recv_function = ping_recv_cb;
        _gPingOptions.sent_function = NULL; //reinterpret_cast<ping_sent_function>(&_ping_sent_cb);

        // Let's go!
        if(ping_start(&_gPingOptions)) {
           _gPingCompleted = false; 
        }

        return !_gPingCompleted;
    }

    return false;
}

bool hasPingCompleted() {
    return _gPingCompleted;
}

bool isPingSuccessful() {
    return _gPingSucceed;
}

Fragment of my monitor code:

void netMonitorHandler() {
    DEBUGPRINT(E("Mon: "));
    char buf[32];
    if(!WiFi.localIP().isSet() || !WiFi.isConnected()) {
        onWiFiReconnect();
        DEBUGPRINTLN(E(" Reconn"));
        return;
    }
    else {
        onWifiConnected();
    }

    if(hasPingCompleted())
     { 
        if (_gPingCount) {
            if (isPingSuccessful()) {
                DEBUGPRINT(E(" OK"));
                _gPingFailCount = 0;
            }
            else {
                DEBUGPRINT(E(" Fail"));
                WiFi.printDiag(Serial);
                _gPingFailCount ++;
        }
        // start a new ping
        IPAddress gwIP = WiFi.gatewayIP();

        DEBUGPRINT(E(" ping Gateway " ));
        DEBUGPRINT(WiFi.localIP());
        DEBUGPRINT(E(" => " ));
        DEBUGPRINT(gwIP);
        startPing(gwIP);
        DEBUGPRINT(E(" ... "));
        _gPingCount ++;
    }


void onWiFiReconnect() {
    if (!WiFi.localIP().isSet() || !WiFi.isConnected()) {
        char buf[32];
        DEBUGPRINT(E("Wifi not connected : "));
        DEBUGPRINTLN(WiFi.status());
        if (gWifiConnected) {
            gWifiConnected = false;
            gReconnectRetryCount = 0;
            WiFi.printDiag(Serial);
            Serial.setDebugOutput(true);
            gReconnectCount ++;
            _gPingCount = 0; // reset ping count to start over
        }
        else if (gReconnectRetryCount >= 5)
        {
            DEBUGPRINTLN(E("Wifi not connected after retrying, reboot system ..."));
            ESP.restart();
        }

        // looks like the auto reconnect handler is not reliable
        if(millis() >= gLastReconnectTime + 30*1000 || millis() < gLastReconnectTime) {
            gReconnectRetryCount ++;
            gLastReconnectTime = millis();
            // try reconnect each 30 seconds
            WiFi.reconnect();
        }
    }
}

@yangyud-cn
Copy link

yangyud-cn commented May 29, 2020

BTW, I tried the older version 2.6.2 (https://github.com/esp8266/Arduino/releases/tag/2.6.2) and with DEBUG_ESP_HTTP_SERVER enabled, it seems much more stable. Without DEBUG_ESP_HTTP_SERVER it will fail with similar issue too. To reproduce the issue, I actually did the stress test with an infinite loop of wget to the board without delay in between.

I actually managed to crash it with a simultaneous "ping -f" while doing the wgets with a WDT crash. But that is already pretty stable.

WDT @ 0x40104185, which is located inside this:
.text.lmacEndFrameExchangeSequence
0x00000000401040bc 0x367 C:\users\yuyang.platformio\packages\framework-arduinoespressif8266\tools\sdk\lib\NONOSDK22x_190703\libpp.a(lmac.o)
0x45a (size before relaxing)

@d-a-v
Copy link
Collaborator

d-a-v commented Jun 7, 2020

@yangyud-cn
Can you provide an MCVE and the way to make it fail, that we could recompile and test locally ?

@devyte
Copy link
Collaborator

devyte commented Jul 6, 2020

Per internal discussions, pushing back to v3.

@devyte devyte modified the milestones: 2.7.2, 3.0.0 Jul 6, 2020
@TD-er
Copy link
Contributor

TD-er commented Jul 6, 2020

Just curious, what is the intended fix, @devyte ?
FYI, I had the suggested work-around of performing an explicit disconnect in my code and it was still happening occasionally to have the WiFi.status to report the wrong state.
In my setup at least, it was showing to be not connected, although it was connected and serving web pages just fine.

My work-around for now (which is far from ideal) is to detect the inconsistency of the wifi state and if it has not improved after some timeout (15 seconds in my setup) it will turn off the WiFi and try again.

It does seem to be a timing issue, as it occurs way more often when I have the UDP syslog feature in my software enabled.
This suspicion of a timing issue could also explain why some nodes appear to suffer from this a lot more then other nodes as they may be occupied more on other tasks and/or actually run faster or slower due to differences in used flash chips or access point brands.

@devyte
Copy link
Collaborator

devyte commented Jul 6, 2020

what is the intended fix

There isn't one. Per previous experience in such cases, this requires investigation, and that can take a while. We don't want to hold up v3 any further.

@vborcea
Copy link
Contributor

vborcea commented Dec 11, 2020

A workaround for me when I detect such anomaly is to do a wifi.reconnect.
Also this still is reproducible in 2.6.2

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 31, 2021

Closing as duplicate #7432, let's follow-up there !

@d-a-v d-a-v closed this as completed Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for feedback Waiting on additional info. If it's not received, the issue may be closed.
Projects
None yet
Development

No branches or pull requests

9 participants