Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic on WPA2 Enterprise Connection #8082

Closed
6 tasks done
Flole998 opened this issue May 28, 2021 · 42 comments · Fixed by #8529
Closed
6 tasks done

Panic on WPA2 Enterprise Connection #8082

Flole998 opened this issue May 28, 2021 · 42 comments · Fixed by #8529

Comments

@Flole998
Copy link
Contributor

Basic Infos

  • This issue complies with the issue POLICY doc.
  • I have read the documentation at readthedocs and the issue is not addressed there.
  • I have tested that the issue is present in current master branch (aka latest git).
  • I have searched the issue tracker for a similar issue.
  • If there is a stack dump, I have decoded it.
  • I have filled out all fields below.

Platform

  • Hardware: ESP-01
  • Core Version: 3.0.0
  • Development Env: Arduino IDE
  • Operating System: Windows

Settings in IDE

  • Module: Generic ESP8266 Module
  • Flash Mode: dio
  • Flash Size: 1M
  • lwip Variant: v2 Lower Memory
  • Reset Method: ck
  • Flash Frequency: 40Mhz
  • CPU Frequency: 80Mhz
  • Upload Using: SERIAL
  • Upload Speed: 115200

Problem Description

Immediately after connecting to a WPA2 Enterprise encrypted network I receive an exception. This seems to be related to the way free is called from a function in eap.c (which seems to be part of the SDK). I am not sure if this is a bug here or in the SDK, but apparently an address that is unmapped was passed to free. I am not sure if maybe the stack got corrupted somewhere as that function hierarchy doesn't really make sense to me.....

I have everything working on the super old version 2.3.0 (which uses completely different functions for setting WPA Enterprise up) but I want to update.

I have reproduced this on all Espressif Firmware Versions that are available.

MCVE Sketch

void setup() {

	Serial.begin(115200);
	Serial.println("Startup!");

	Serial.print("Heap Free: ");
	Serial.println(system_get_free_heap_size());

	enableWiFiAtBootTime();

	wifi_set_opmode_current(STATION_MODE);

	struct station_config wifi_config;

	memset(&wifi_config, 0, sizeof(wifi_config));
	strcpy((char*)wifi_config.ssid, ssid);
	wifi_station_set_config_current(&wifi_config);

	wifi_station_set_wpa2_enterprise_auth(1);

	wifi_station_set_enterprise_identity((uint8_t*)identity, strlen(identity));
	wifi_station_set_enterprise_cert_key(esp_cert_pem, esp_cert_pem_len, esp_key_pem, esp_key_pem_len, NULL, 1);

	wifi_station_disconnect();
	wifi_station_connect();

	Serial.println(F("Waiting for connection..."));


	while (WiFi.status() != WL_CONNECTED) {
		if (millis() > 60000) {
			Serial.println(F("Took wayyy to long. Restarting..."));
			ESP.restart();
		}

		delay(1000);
	}
}

void loop() {
         Serial.println("Connected!");
}

Debug Messages

No poison after block at: 0x406e4a2f, actual data: 0x0 0x0 0x80 0x0

User exception (panic/abort/assert)
--------------- CUT HERE FOR EXCEPTION DECODER ---------------

 Error
   :?:::0x4024b080:etharp_output
   :?:::0x40201a52:raise_exception
   :?:::0x40201aaf:__panic_func
   0x40100e54 get_unpoisoned_check_neighbors
   0x401012f2 umm_free
   :\packages\esp8266\hardware\esp8266\3.0.0\cores\esp8266\umm_malloc/umm_malloc.cpp:574
   0x4010130d umm_poison_free_fl
   :?:::0x4024c4e0:etharp_output
   :?:::0x40229979:wpa_set_bss
   0x40100827 HeapSelectDram
   :\packages\esp8266\hardware\esp8266\3.0.0\cores\esp8266/heap.cpp:370
   :?:::0x4022aac7:wpabuf_free
   :?:::0x4022bb92:wpa2_sm_rx_eapol
   :?:::0x4022bba6:wpa2_sm_rx_eapol
   :?:::0x4022bbe1:wpa2_sm_rx_eapol
   :?:::0x4022b7ac:wpa2_sm_rx_eapol
   :?:::0x40223129:sta_input
   :?:::0x40240ccf:pp_tx_idle_timeout
   :?:::0x4024058f:ppPeocessRxPktHdr
   : ?? ??:0
   : ?? ??:0
   :?:::0x40105b88:call_user_start_local
   :?:::0x40105b8e:call_user_start_local
   :?:::0x4010000d:call_user_start
   0x40235458 cont_ret
   0x40235411 cont_continue
   

@redfast00
Copy link

Can confirm, I also have this issue

@Flole998
Copy link
Contributor Author

Flole998 commented Jul 17, 2021

@wujiangang Can you help with this issue? There is a bug introduced somewhere (seems like it's in wpabuf_free) in the nonos-sdk that frees memory that was never allocated (or to be more precise the pointer is pointing to a memory region that's invalid), would be great if that could get fixed.

@Flole998
Copy link
Contributor Author

@d-a-v Unfortunately it looks like this SDK issue won't get fixed. What would you think about relaxing the free() check a little to work around this issue? Maybe just silently drop if free() is called on invalid memory (and optionally check if it's being called by wpabuf_free())? I could also patch out that call from the library, but I'm not sure if you want to have a modified binary in this repository?

Or maybe you know someone from espressif who you can contact to get this fixed in the SDK?

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 28, 2022

Or maybe you know someone from espressif who you can contact to get this fixed in the SDK?

I have no more power than you do :]

About relaxing the call to free() in this repo's binary, we should first get to know whether it is useful ?
Did you try it ?

@Flole998
Copy link
Contributor Author

Yes, I have tried it (and others have aswell, see https://bbs.espressif.com/viewtopic.php?t=5962&start=20#p71905), it's really just this free()-call that's causing issues.

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 28, 2022

Thanks for pointing us !
It would be very interesting for some of us to have it working.
It wouldn't be the first firmware hack.

Unconditionally patching the check_poison() to return 1 will not be accepted.
But I think we can imagine a conditional build which would enable a new function which would temporarily force check_poison()'s return value.

I think something like this would be acceptable:
We'd have to use -DANTIDOTE=1 (thanks to incoming #8504), then:

umm_antidote(true); // disable umm_check_poison()
wifi_station_set_wpa2_enterprise_auth(1);
wifi_station_set_enterprise_identity((uint8*)username, strlen(username));
wifi_station_set_enterprise_username((uint8*)username, strlen(username));
wifi_station_set_enterprise_password((uint8*)password, strlen(password));
wifi_station_set_enterprise_ca_cert((byte*)ca_cert, strlen(ca_cert));
wifi_station_connect();
umm_antidote(false); // re/enable umm_check_poison()

(s/antidote/whatever/g)

@d-a-v d-a-v added this to the 3.1 milestone Mar 28, 2022
@Flole998
Copy link
Contributor Author

I thought about something like ignoring free() on unmapped memory (as this address is way to high) or ignoring based on the calling functions address (so that wpabuf_free() is basically whitelisted).

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 28, 2022

Making umm_free() a weak function calling __umm_free() (as it is currently done for yield()) might be another easier and less invasive solution as it allows users to redefine it to something else, to be able to conditionally call the real __umm_free().

@mhightower83
Copy link
Contributor

mhightower83 commented Mar 28, 2022

No poison after block at: 0x406e4a2f, actual data: 0x0 0x0 0x80 0x0

0x406e4a2f is not a DRAM address. Earlier umm_malloc did not verify heap addresses on non-debug builds and I believe would have crashed with a corrupted heap. The current version I believe will abort or panic - I have lost track of where I am in pushing changes. The current version does range checking to determine which heap an allocation belongs on. A bad address should be caught earlier.

Since this is in the SDK any funny stuff could be moved to vPortFree 's thin wrapper.
https: //github.com/esp8266/Arduino/blob/732db594929965c7d298b6257fa74cfbd1be0398/cores/esp8266/heap.cpp#L372

@mhightower83
Copy link
Contributor

Never mind it is not that simple. The poison check fails after the block. The block starts in DRAM and extends far beyond DRAM address space. Most likely umm_block data is invalid resulting in an oversized allocation calculation. umm_info() does not show the allocation being freed as allocated. So this could be a double free or a corrupt heap pointer.

@Flole998
Copy link
Contributor Author

A double-free could probably easily be detected by adding a print() to free(), then we can see if it's indeed called twice (and also who called it the first time)? A corrupt heap pointer would be a lot harder to identify without having access to the source code...

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 29, 2022

Did you try to add a printf("(%p)\n", ptr); in umm_free() and umm_malloc()?
This with a printf() between each line of the simplest test code allows to identify and maybe characterize the wrong call.

@Flole998
Copy link
Contributor Author

Flole998 commented Mar 29, 2022

I don't have the latest version installed at the moment, I guess I need to setup the latest version on a test system (I don't want to overwrite the old working version).

The issue only comes up when a connection is established, so there's no point in adding additional debug output between each line of the MCVE code as it will only show that the issue is happening somewhere during the connection process (I think the stacktrace also shows that).

@mhightower83
Copy link
Contributor

mhightower83 commented Mar 30, 2022

Using some experimental free() pointer checking code, it looks like eap.c at line 935 is responsible for freeing twice. Of course, it is experimental code I could be in error. The first esp1= is the address from the previous call. The caller's address was saved in the freed allocation, so it is possible it was overwritten. The second esp1= is the current (last) caller's address. (My wording is awkward "previous" is before "last")

Heap panic: free/realloc
  The pointer 0x3fff2fac is not an active allocation.
  Last free/realloc caller: 0x401004b8
  File: eap.c:935
  epc1=0x401004b8, epc2=0x00000000, epc3=0x00000000, excvaddr=0x3fff2fac, depc=0x00000000
  epc1=0x401004b8, epc2=0x00000000, epc3=0x00000000, excvaddr=0x3fff2fac, depc=0x00000000

ref: SDK version: 2.2.2-dev(38a443e)

@Flole998
Copy link
Contributor Author

Are there debug symbols in some file? I'm a little surprised the exact line is shown, that's all closed source isn't it?....

Just so I understand it correctly: free() is called 2 times by the exact same line of the exact same function? And that's causing all the issues? Those register dumps look absolutely identical to me. That'd make it a lot harder to patch it though.... I am wondering if it even ends at a double-free or if they do a triple-free if you would let them.... The only hackaround would be to ignore the first free and only execute the second one (since the first one shouldn't be there apparently), but things are getting really dirty then....

@mhightower83
Copy link
Contributor

mhightower83 commented Mar 30, 2022

Are there debug symbols in some file? I'm a little surprised the exact line is shown, that's all closed source isn't it?....

No. Yes. However, the malloc API gives us hints. The SDK uses the portable heap library or something like that. The functions have fields that include the module/file name and line number that is calling.

The epc1= lines are fabricated. The format was chosen so the Exception decoder might help someday. The non-zero info is what we know. epc1= is the calling address of free() or vPortFree() and excvaddr= is the supplied pointer.

a triple-free

I think that is a valid concern. I am not sure what I remember, I saw too many unresolved things. Unfortunately, I have run out of time. I have some work that needs to be finished this week. So I have to resist poking at this any further.

void IRAM_ATTR vPortFree(void *ptr, const char* file, int line)

@Flole998
Copy link
Contributor Author

@mhightower83 Which SDK and core did you use for your tests? I setup a test environment with the latest git-master and just went through the pre-3 and the latest 2.2.1+119 and I am always getting this exception, which is kind of funny as I am only using a single heap and those functions should be esssentially empty.....

Exception 29: StoreProhibited: A store referenced a page mapped with an attribute that does not permit stores
PC: 0x4000df64
EXCVADDR: 0x00000000

Decoding stack results
0x401002f4: pvPortZalloc(size_t, char const*, int) at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\umm_malloc/umm_heap_select.h line 91
0x40100914: malloc(size_t) at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\umm_malloc\umm_malloc.cpp line 885
0x40100914: malloc(size_t) at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\umm_malloc\umm_malloc.cpp line 885
0x401002d0: pvPortMalloc(size_t, char const*, int) at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\umm_malloc/umm_heap_select.h line 91
0x40201c22: loop_task(ETSEvent*) at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\core_esp8266_main.cpp line 259
0x40233658: umm_init() at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\umm_malloc\umm_malloc.cpp line 527
0x4010032c: mmu_wrap_irom_fn(void (*)()) at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\mmu_iram.cpp line 205
0x401000b3: app_entry_redefinable() at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\core_esp8266_main.cpp line 375
0x4010016c: ets_post(uint8, ETSSignal, ETSParam) at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\core_esp8266_main.cpp line 227
0x4010016c: ets_post(uint8, ETSSignal, ETSParam) at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\core_esp8266_main.cpp line 227
0x4010016c: ets_post(uint8, ETSSignal, ETSParam) at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\core_esp8266_main.cpp line 227
0x4010016c: ets_post(uint8, ETSSignal, ETSParam) at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\core_esp8266_main.cpp line 227
0x4010016c: ets_post(uint8, ETSSignal, ETSParam) at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\core_esp8266_main.cpp line 227
0x40202a94: uart_write(uart_t*, char const*, size_t) at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\uart.cpp line 547
0x40201420: HardwareSerial::write(unsigned char const*, unsigned int) at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266/HardwareSerial.h line 193
0x4020142c: HardwareSerial::write(unsigned char const*, unsigned int) at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266/HardwareSerial.h line 193
0x4010016c: ets_post(uint8, ETSSignal, ETSParam) at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\core_esp8266_main.cpp line 227
0x40201d59: __yield() at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266/core_esp8266_features.h line 65
0x402010f2: setup() at C:\Users\Flole\Documents\Arduino\ESP8266-Demo/ESP8266-Demo.ino line 265
0x40201df8: loop_wrapper() at C:\Users\Flole\Documents\Arduino\hardware\esp8266\esp8266\cores\esp8266\core_esp8266_main.cpp line 244

The sketch has been slightly adapted:

void setup() {

  Serial.begin(115200);
  Serial.println("Startup!");

  Serial.print("Heap Free: ");
  Serial.println(system_get_free_heap_size());

  enableWiFiAtBootTime();

  wifi_set_opmode_current(STATION_MODE);

  struct station_config wifi_config;

  memset(&wifi_config, 0, sizeof(wifi_config));
  strcpy((char*)wifi_config.ssid, ssid);
  wifi_station_set_config_current(&wifi_config);

  wifi_station_set_wpa2_enterprise_auth(1);

  wifi_station_set_enterprise_identity((uint8_t*)identity, strlen(identity));
  wifi_station_set_enterprise_cert_key(esp_cert_pem, esp_cert_pem_len, esp_key_pem, esp_key_pem_len, NULL, 1);

  wifi_station_disconnect();
  wifi_station_connect();

  Serial.println("Waiting for connection...");


  while (WiFi.status() != WL_CONNECTED) {
    yield();
    
  }
}

void loop() {
         Serial.println("Connected!");
}

@mhightower83
Copy link
Contributor

From:

  Serial.printf("SDK version: %s\n", system_get_sdk_version());

SDK version: 2.2.2-dev(38a443e)

@Flole998
Copy link
Contributor Author

I have figured out what caused the crashing: I need to define a debug serial port, otherwise it will just crash with a stacktrace like the one from above. If I don't define a debug serial port it, even with that poison check removed there are crashes.

There seems to be a massive corruption somewhere, I am trying to get additional output for debugging in umm_local.c and modified the lines after 114 like this:

        if (!poison) {
            if (file) {
		DBGLOG_FORCE(true, "Missing poison while trying to free 0x%x in %s:%d\n", (int)vptr, file, line);
                __panic_func(file, line, "");
            } else {
		DBGLOG_FORCE(true, "Missing poison while trying to free 0x%x in <UNKNOWN>\n", (int)vptr);
                abort();
            }
        }

so I should get a nice error message, right? Wrong... All I get is

No poison after block at: 0x40ee4ac7, actual data: 0x0 0x0 0x80 0x0
Missing poison while trying to free 0x3fff45ac in Fatal exception 3(LoadStoreErrorCause):
epc1=0x4000228b, epc2=0x00000000, epc3=0x00000000, excvaddr=0x40249980, depc=0x00000000

--------------- CUT HERE FOR EXCEPTION DECODER ---------------

Exception (3):
epc1=0x4000228b epc2=0x00000000 epc3=0x00000000 excvaddr=0x40249980 depc=0x00000000

....

When I comment those debug lines out again I can actually see a little more:

No poison after block at: 0x40ee47d6, actual data: 0x0 0x0 0x0 0x80

User exception (panic/abort/assert)
--------------- CUT HERE FOR EXCEPTION DECODER ---------------

Panic eap.c:935 

>>>stack>>>
...

I went a little further and simply disallowed free() from anything in line 935 by doing a if(line == 935) return; in umm_poison_free_fl(). That works aswell. So I improved that and added a message that says "Disallowing free for 0x3fff45bc" there. But that message only appears once, so there is only a single attempt to free that memory (from a line 935). So I logged some malloc results which were potentially interesting using this nice code (there are lot's of 44 byte allocs, no clue what those are for or if maybe I even added a loop with those, so I supressed them):

...
	if((unsigned int)ptr < 0x3fff5000 && (unsigned int)ptr > 0x3fff3000 && size != 44)
		DBGLOG_ERROR("Allocated %d bytes at 0x%lx\n", size, (unsigned long)ptr);

    return ptr;

That gives me this:

Allocated 28 bytes at 0x3fff3044
Allocated 28 bytes at 0x3fff3094
Allocated 24 bytes at 0x3fff3064
Allocated 24 bytes at 0x3fff3014
Allocated 1568 bytes at 0x3fff3424
Allocated 1028 bytes at 0x3fff3424
Allocated 1018 bytes at 0x3fff382c
Allocated 1568 bytes at 0x3fff3c2c
Allocated 1239 bytes at 0x3fff390c
Allocated 1846 bytes at 0x3fff3dec
Allocated 15 bytes at 0x3fff3014
Allocated 16 bytes at 0x3fff302c
Allocated 31 bytes at 0x3fff3064
Allocated 268 bytes at 0x3fff452c
Allocated 28 bytes at 0x3fff463c
Allocated 28 bytes at 0x3fff468c
Allocated 1846 bytes at 0x3fff3dec
Allocated 16 bytes at 0x3fff3064
Allocated 18 bytes at 0x3fff307c
Allocated 35 bytes at 0x3fff3014
Allocated 32 bytes at 0x3fff465c
Allocated 268 bytes at 0x3fff452c
Allocated 15 bytes at 0x3fff465c
Allocated 15 bytes at 0x3fff4674
Allocated 16 bytes at 0x3fff465c
Allocated 16 bytes at 0x3fff4674
Allocated 31 bytes at 0x3fff465c
Allocated 31 bytes at 0x3fff3014
Allocated 18 bytes at 0x3fff3014
Allocated 18 bytes at 0x3fff302c
Allocated 33 bytes at 0x3fff3014
Allocated 16 bytes at 0x3fff465c
Allocated 3200 bytes at 0x3fff390c
Allocated 15 bytes at 0x3fff465c
Allocated 15 bytes at 0x3fff4674
Allocated 16 bytes at 0x3fff465c
Allocated 16 bytes at 0x3fff4674
Allocated 31 bytes at 0x3fff465c
Allocated 31 bytes at 0x3fff3014
Allocated 18 bytes at 0x3fff3014
Allocated 18 bytes at 0x3fff302c
Allocated 33 bytes at 0x3fff3014
Allocated 16 bytes at 0x3fff465c
Allocated 28 bytes at 0x3fff465c
Allocated 652 bytes at 0x3fff4c9c
Allocated 652 bytes at 0x3fff4f2c
Allocated 76 bytes at 0x3fff4594
Allocated 364 bytes at 0x3fff4dbc
Allocated 76 bytes at 0x3fff45e4
Allocated 364 bytes at 0x3fff4dbc
Allocated 76 bytes at 0x3fff4594
Allocated 364 bytes at 0x3fff4dbc
Allocated 28 bytes at 0x3fff465c
Allocated 204 bytes at 0x3fff4a7c
Allocated 236 bytes at 0x3fff4b4c
Allocated 204 bytes at 0x3fff484c
Allocated 236 bytes at 0x3fff4b4c
Allocated 204 bytes at 0x3fff4a7c
Allocated 236 bytes at 0x3fff4b4c
Allocated 204 bytes at 0x3fff484c
Allocated 236 bytes at 0x3fff4b4c
Allocated 236 bytes at 0x3fff4c1c
Allocated 236 bytes at 0x3fff4cec
Allocated 236 bytes at 0x3fff4dbc
Allocated 236 bytes at 0x3fff4e8c
Allocated 236 bytes at 0x3fff4f5c
Allocated 140 bytes at 0x3fff4594
Allocated 76 bytes at 0x3fff4594
Allocated 140 bytes at 0x3fff4594
Allocated 76 bytes at 0x3fff4594
Allocated 140 bytes at 0x3fff4594
Allocated 204 bytes at 0x3fff49ac
Allocated 236 bytes at 0x3fff4c3c
Allocated 204 bytes at 0x3fff4a7c
Allocated 236 bytes at 0x3fff4c3c
Allocated 204 bytes at 0x3fff49ac
Allocated 236 bytes at 0x3fff4c3c
Allocated 204 bytes at 0x3fff4a7c
Allocated 236 bytes at 0x3fff4c3c
Allocated 236 bytes at 0x3fff4d0c
Allocated 236 bytes at 0x3fff4ddc
Allocated 236 bytes at 0x3fff4eac
Allocated 236 bytes at 0x3fff4f7c
Allocated 140 bytes at 0x3fff484c
Allocated 76 bytes at 0x3fff4b4c
Allocated 140 bytes at 0x3fff484c
Allocated 140 bytes at 0x3fff484c
Allocated 76 bytes at 0x3fff4b4c
Allocated 140 bytes at 0x3fff484c
Allocated 140 bytes at 0x3fff484c
Allocated 56 bytes at 0x3fff4594
Allocated 256 bytes at 0x3fff484c
Allocated 1434 bytes at 0x3fff4954
Allocated 1568 bytes at 0x3fff4ef4
Allocated 1568 bytes at 0x3fff3424
Allocated 99 bytes at 0x3fff45d4
Allocated 256 bytes at 0x3fff3424
Allocated 256 bytes at 0x3fff352c
Allocated 76 bytes at 0x3fff45d4
Allocated 1568 bytes at 0x3fff3634
Disallowing free for 0x3fff45dc

It got allocated properly, so I reduced the "trace range" a little and added:

	if((unsigned int)ptr < 0x3fff5000 && (unsigned int)ptr > 0x3fff4000)
		DBGLOG_ERROR("Freeing at 0x%lx\n", (unsigned long)ptr);

So now I will see if it got free()'d before. And there we go:

Allocated 99 bytes at 0x3fff4604
Allocated 256 bytes at 0x3fff3454
Allocated 256 bytes at 0x3fff355c
Freeing at 0x3fff460c
Allocated 76 bytes at 0x3fff4604
Allocated 1568 bytes at 0x3fff3664
Freeing at 0x3fff460c
Freeing at 0x3fff4714
Freeing at 0x3fff4674
Freeing at 0x3fff46e4
Freeing at 0x3fff46c4
Freeing at 0x3fff4884
Freeing at 0x3fff45cc
Disallowing free for 0x3fff460c

First line allocs it, first free frees it again and then it end's with another free.... So the last one is most likely the one we actually want. So now I want the line numbers aswell:

Freeing at 0x3fff497c by 213
...
Allocated 99 bytes at 0x3fff45f4
Allocated 256 bytes at 0x3fff3444
Allocated 256 bytes at 0x3fff354c
Freeing at 0x3fff45fc by 213
Allocated 76 bytes at 0x3fff45f4
Allocated 1568 bytes at 0x3fff3654
Freeing at 0x3fff45fc by 672
Freeing at 0x3fff4704 by 412
Freeing at 0x3fff4664 by 62
Freeing at 0x3fff46d4 by 412
Freeing at 0x3fff46b4 by 62
Freeing at 0x3fff4874 by 167
Freeing at 0x3fff45bc by 267
Disallowing free for 0x3fff45fc by 935

So we are interested in line 213. Unfortunately there are 2 cases where a line 213 is calling and accessing file only shows the LoadStoreError from above. Anyways, now I am interested in that line 213 only, so I only log those:

Allocated 76 bytes at 0x3fff4b74
Allocated 140 bytes at 0x3fff4874
Allocated 140 bytes at 0x3fff4874
Allocated 56 bytes at 0x3fff45bc
Allocated 256 bytes at 0x3fff4874
Allocated 1434 bytes at 0x3fff497c
Allocated 1568 bytes at 0x3fff4f1c
Allocated 1568 bytes at 0x3fff344c
Freeing at 0x3fff4984 by 213
Allocated 99 bytes at 0x3fff45fc
Allocated 256 bytes at 0x3fff344c
Freeing at 0x3fff4604 by 213
Allocated 77 bytes at 0x3fff45fc
Allocated 1568 bytes at 0x3fff3554
Disallowing free for 0x3fff139c by 935

So just 2 cases where a line 213 frees something (it could still be in different files), unfortunately I still get that LoadStoreError if I try to access the file pointer. Anyways, the __panic_func can do it, so I simply call it to get a full stacktrace:

Freeing at 0x3fff497c by 213

User exception (panic/abort/assert)
--------------- CUT HERE FOR EXCEPTION DECODER ---------------

Panic wpabuf.c:213 

>>>stack>>>

sweet, first call is in wpabuf.c:213. I kinda get the feeling that's wpabuf_free() there (just guessing, but it doesn't really matter). Anyways, on to the next one:

Allocated 1568 bytes at 0x3fff344c
Freeing at 0x3fff4984 by 213
Allocated 99 bytes at 0x3fff45fc
Allocated 256 bytes at 0x3fff344c
Freeing at 0x3fff4604 by 213

User exception (panic/abort/assert)
--------------- CUT HERE FOR EXCEPTION DECODER ---------------

Panic wpabuf.c:213 

So same caller as the "bad" call from above. That's not good, so we can't patch these, but we might not even have to since there is still that second call in line 935, so let's see where that is being called from:

Allocated 1568 bytes at 0x3fff3444
Freeing at 0x3fff497c by 213
Allocated 99 bytes at 0x3fff45f4
Allocated 256 bytes at 0x3fff3444
Allocated 256 bytes at 0x3fff354c
Freeing at 0x3fff45fc by 213
Allocated 76 bytes at 0x3fff45f4
Allocated 1568 bytes at 0x3fff3654
Disallowing free for 0x3fff45fc by 935

User exception (panic/abort/assert)
--------------- CUT HERE FOR EXCEPTION DECODER ---------------

Panic eap.c:935 

So eap.c:935 is calling free() on something that has been free'd already. This would be the right call to suppress. So let's look at it in a disassembler:

sub_4022A404:
addi           , a1, a1, 0xF0
s32i           , a12, a1, 4
s32i           , a0, a1, 0 
mov            , a12, a2
l32i           , a2, a2, 0xC0 // Looks like this is the pointer, passed in a2
call0          , wpabuf_free // Call wpabuf_free
l32i           , a2, a12, 0xB8 // Looks like this is the pointer, passed in a2 
l32r           , a3, off_402298FC ; "eap.c" // Load caller name
movi           , a4, 935 // Load line number
movi.n         , a0, 0
s32i           , a0, a12, 0xC0 
l32r           , a0, vPortFree_0 // Get address of vPortFree
callx0         , a0 // Call vPortFree
movi.n         , a2, 0
l32i.n         , a0, a1, 0
s32i           , a2, a12, 0xB8
l32i.n         , a12, a1, 4
addi           , a1, a1, 0x10
ret.n

I added 2 comments for better readability for non-assembler-people reading this, a1 is the stack pointer. This looks pretty much like they call wpabuf_free() and then vportFree() on the same pointer. The second call should not be there at all. So patching that callx0 to a nop sounds like a great idea. But first I need to find this function in the libwpa2.a, so I load it in IDA, select the eap.o file, search for the 935 (line number), find it and see that it's part of the eap_sm_abort function (0x11DB is the offset within that file. So eap_sm_abort is basically just a function that double-frees the memory and all we really want is to single-free it. So now I don't know what's the best way to get a fix in for this, of course I could just provide my patched binary and we're done (at least for one of the SDK versions), but isn't there an option aswell to replace the entire eap_sm_abort() function with our own implementation which would just call wpabuf_free() and that solves it for all the broken SDKs? I have a fixed libwpa.a for the 190703 SDK which I could provide aswell.

The last question that I was wondering about was if my fix is now leaking memory (I mean I kicked out a call to free(), so the possibility is definitely there), and the answer is unfortunately yes. My first test shows 8 bytes less free heap when I reconnect the first time, 40 bytes less free heap for the following 2 (re-)connections, then 8 again and so on. I guess umm_info() could somehow show me where the leak is? Assuming it is even a leak and not just fragmentation? The pattern 8-40-40-8-40-40... surprises me a little.

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 31, 2022

Nice hacking 🚀

but isn't there an option aswell to replace the entire eap_sm_abort() function with our own implementation which would just call wpabuf_free() and that solves it for all the broken SDKs?

Ultimately yes. If you can do that for the rest of the firmware this would help alot 😝
We already removed some object files from firmwares but I don't know whether it is applicable to this use-case.

and the answer is unfortunately yes. My first test shows 8 bytes less free heap when I reconnect the first time, 40 bytes less free heap for the following 2 (re-)connections, then 8 again and so on.

Logging again malloc/free calls may allow you to locate what is leaked.
Or you can also let it run for some time and see if the average available heap stay steady.

@Flole998
Copy link
Contributor Author

Well unless I'm mistaken the correct implementation is really just a
void eap_sm_abort(eap-data* ptr) { wpabuf_free(ptr->wpabuf); }.

I think the implementation in https://github.com/espressif/esp-idf/blob/2467aa7f6c177ed040e385657304b0ea413e8133/components/wpa_supplicant/src/eap_peer/eap.c#L815 is how it should look like. I wonder if we could replace the entire closed source libwpa, libwpa2 and libwps stuff with this open source implementation or if there are too many changes necessary then to make it compatible with the other libraries. I believe for ESP32 it can be self-built from the source code in the esp-idf? Or are there parts missing so they just built the closed source libraries for ESP32 based on the repo and if you want changes you need to get them in that repo somehow?

I've seen the fix shell script(and I believe the PATH-export in it is missing a "../", at least if its supposed to be called from one of the NONOS* directories ;) ), I've looked at it again and the symbol redefinition sounds suitable for this case.

The heap stays steady until I cause a reconnection by kicking it from the AP. I don't know if that's "normal" for this SDK (I wouldn't be surprised at all) or if this is (another) WPA Enterprise issue. The memory tracking isn't very nice, I had to suppress some messages (for 44 bytes allocations for example) as otherwise I'd get crashes, and you get a whole load of messages so it's super hard to follow them through (I guess I'd have to write a parser that parses my log and keeps track of the leak then). If there is a 8 byte leak and a 40 byte leak somewhere I would have to look for 8 and 40 byte allocations? Or is there an overhead involved aswell so that the actual requested memory is shorter? 8 bytes is really short, if there's overhead involved aswell that'd probably be a 4 or 6 byte allocation....

@mhightower83
Copy link
Contributor

The input to the function is a structure pointer a2.
Two buffer pointers from that structure are loaded, one from offset 0xC0 and the second from offset 0xB8.

  • a2[0xC0] is passed to wpabuf_free() and
  • a2[0xB8] is passed to vPortFree_0
  • each location a2[0xC0] and a2[0xB8] are set to zero before the function returns.

It should not be a problem unless a2[0xC0] and a2[0xB8] hold the same value.

Unless I am tired and confused

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 31, 2022

If there is a 8 byte leak and a 40 byte leak somewhere I would have to look for 8 and 40 byte allocations?

I think so, unless realloc() is called in between and possibly change buffer sizes

@Flole998
Copy link
Contributor Author

@mhightower83 You are totally right, I got confused with the (kinda out of order) instructions right before the vPortFree stuff is loaded, but that's the zeroing and not related to the function call...

I think we are super close though, wpabuf_free() is doing this:

wpabuf_free:
addi           , a1, a1, 0xF0
s32i.n         , a12, a1, 4
s32i.n         , a0, a1, 0
mov.n          , a12, a2
beqz.n         , a2, loc_4022934B
l32i.n         , a2, a2, 8
l32r           , a3, off_402291CC ; "wpabuf.c"
movi           , a4, 212
l32r           , a0, vPortFree_0
callx0         , a0
mov.n          , a2, a12
l32r           , a3, off_402291CC ; "wpabuf.c"
movi           , a4, 213
l32r           , a0, vPortFree_0
callx0         , a0
loc_4022934B:
l32i.n         , a12, a1, 4
l32i.n         , a0, a1, 0
addi           , a1, a1, 0x10
ret.n

So they are essentially freeing 2 pointers in there. 0xC0 - 8 would be 0xB8, but that's an addition and not a subtraction there?

@Flole998
Copy link
Contributor Author

Flole998 commented Apr 1, 2022

Ah sorry my analysis was partially wrong: 0xB8 is eapKeyData, it's not the wpabuf that is being freed twice here apparently, if I look again at one of my debug outputs from above:

Freeing at 0x3fff497c by 213
...
Allocated 99 bytes at 0x3fff45f4 // Some wpabuf being allocated
Allocated 256 bytes at 0x3fff3444
Allocated 256 bytes at 0x3fff354c
Freeing at 0x3fff45fc by 213 // Some wpabuf being freed
Allocated 76 bytes at 0x3fff45f4 // That very same memory being reused
Allocated 1568 bytes at 0x3fff3654
Freeing at 0x3fff45fc by 672 // eapKeyData freed from wpa2_sm_rx_eapol
Freeing at 0x3fff4704 by 412
Freeing at 0x3fff4664 by 62
Freeing at 0x3fff46d4 by 412
Freeing at 0x3fff46b4 by 62
Freeing at 0x3fff4874 by 167
Freeing at 0x3fff45bc by 267
Disallowing free for 0x3fff45fc by 935 // eapKeyData freed from eap_sm_abort

And the first free (for our actual memory, ignore the wpabuf that momentarily existed there) is happening right after wpa_set_pmk(), they apparently simply forgot to zero the pointer there. The assembly looks like this:

s32i.n         , a0, a1, 0
...
[few branches here left out]
...
keyOK:
l32i           , a2, a0, 0xB8
call0          , wpa_set_pmk
l32r           , a3, eapcPtr
l32i           , a2, a1, 0
movi           , a4, 672
l32i           , a2, a2, 0xB8
l32r           , a0, vPortFree_0
callx0         , a0
l32i           , a3, a1, 0
movi           , a12, 1
s8i            , a12, a3, 0xAF
j              , loc_40229FCB

Looks like a free() call which leaves the pointer untouched afterwards to me. If it would be set to null here properly then everything would be alright..... Fortunately there is some-ish space if we don't care about the line number and file name, 2 instructions which would be enough to set a register to 0 and write it to the offset 0xB8. It would completely mess up the file and line stuff though, there would be an invalid pointer passed as file, so if something ever relies on that being valid it would break completely.

To make things worse, the pointer is also accessed in eap_sm_process_request, when a new key is received the old memory is freed. So if we would not touch that assembly above and if there is a way that eap_sm_process_request get's called before the eap_sm_abort where our pointer is nulled we would have the same double-free again, just at another location.

Long story short: The pointer must be nulled there (or the free must be removed there, which needs to happen manually as it's not like we can replace an entire function here), doing it where I first did it is not a good idea. Maybe I could have it fixed by espressif when I point to the exact line number and state the exact code that needs to be added, but even then this fix will only go into master, I don't think there is any chance they would backport it to 2.2.x, and getting master into Arduino is quite tricky (I tried earlier today and failed with the partition stuff)....
So to move forward I suggest that I use bsdiff to create a patch for each SDK version currently supported which would remove that call to free in wpa2_sm_rx_eapol, and fix_sdk_libs.sh then applies that patch.

@mhightower83
Copy link
Contributor

If you try to free a NULL pointer, it is ignored.

@Flole998
Copy link
Contributor Author

Flole998 commented Apr 1, 2022

Yeah I know, the problem is though that the SDK frees it in the assembly I posted above but does not update the pointer to null. If they would update it properly it would not be freed a second time as it's null.

I corrected my previous comment at some places, I was referring to the pointer as "memory", like this it should be more clear what I actually mean. In fact I had quite a few mistakes/unclear wordings in that comment, I've corrected those now. Anyways, it's wayy to late so I'll pick this up tomorrow, I'll probably write a simple patch function that checks if at the desired offset the expected data is located and then updates it accordingly, so this could be integrated in fix_sdk_libs.sh.

@mhightower83
Copy link
Contributor

mhightower83 commented Apr 2, 2022

Sounds like you found a working solution.
As alternative quick test in vPortFree I added: if (line == 672) return;
The esp8266 connected and I started pinging the esp8266 from my WS 2.5 hours ago and it is still running.
And, no reboots. 😄

Update:
It ran for 10+ hours w/o rebooting. In the 1st 40 minutes after connection, a total of 396 additional bytes were allocated. With no additional activity from lines 672 or 935.

@Flole998
Copy link
Contributor Author

Flole998 commented Apr 3, 2022

I had a look at the latest libwpa2.a from NONOS-SDK and it looks like the issue is fixed there. So if we pull that in some day the issue will get fixed eventually.

For all the other SDKs I decided the best fix would be to remove that vPortFree() call entirely, I have already prepared the modifications for the fix_sdk_libs.sh, just need to checkout a clean master, apply those and then prepare a PR.

Have you also observed that 8-40-40-8... memory leak when you kick the ESP and it reconnects? I think you need code that specifically instructs the ESP to reconnect (so basically put all the stuff in a connect() function and call that whenever there's no connection), otherwise it will not reconnect automatically.

@mhightower83
Copy link
Contributor

Using this modification to heap.cpp

#define DEBUG_PRINTF ets_uart_printf

void IRAM_ATTR print_free_loc(const char str[], void *ptr, const char* file, int line)
{
    {
        DEBUG_PRINTF("\n%s: %p, ", str, ptr);
        bool inISR = ETS_INTR_WITHINISR();
        if (NULL == file || (inISR && (uint32_t)file >= 0x40200000)) {
            DEBUG_PRINTF("File: %p", file);
        } else if (!inISR && (uint32_t)file >= 0x40200000) {
            char buf[strlen_P(file) + 1];
            strcpy_P(buf, file);
            DEBUG_PRINTF(buf);
        } else {
            DEBUG_PRINTF(file);
        }
        DEBUG_PRINTF(":%d\n", line);
    }
}

void IRAM_ATTR vPortFree(void *ptr, const char* file, int line)
{
    static void *delayed_free = NULL;
#if defined(UMM_POISON_CHECK) || defined(UMM_INTEGRITY_CHECK)
    // While umm_free internally determines the correct heap, UMM_POISON_CHECK
    // and UMM_INTEGRITY_CHECK do not have arguments. They have to rely on the
    // current heap to identify which one to analyze.
    //
    // Should not need this for UMM_POISON_CHECK_LITE, it directly handles
    // multiple heaps. DEBUG_ESP_OOM not tied to any one heap.
    HeapSelectDram ephemeral;
#endif
    if (NULL == ptr) {
    } else
    if (line == 935) {
        if (delayed_free != ptr) {
            print_free_loc("free", delayed_free, file, line);
            heap_vPortFree(delayed_free, file, line);
        }
        delayed_free = NULL;
        print_free_loc("free", ptr, file, line);
    } else
    if (line == 672) {
        if (delayed_free) {
            print_free_loc("leak", delayed_free, file, line);
        }
        print_free_loc("stow", ptr, file, line);
        delayed_free = ptr;
        return;
    }
    return heap_vPortFree(ptr, file, line);
}

I did not see any leaks get reported from those two locations.
However, I did see the total free heap go down 32 bytes after each disconnect, connect cycle.
I check Free Heap from loop() via a serial hotkey read routine.

Side notes on Heap
Heap memory is allocated in blocks of 8 bytes.
Each allocation has a header of 4 bytes.
So one block can hold 4 bytes of data.
Two blocks can hold 12 bytes of data etc.
Debug build enables poison check which adds an overhead of 12 more bytes to the size requested.
4 bytes for allocation size, 4 bytes of poison before the start of user-accessible memory
and 4 bytes of poison after the user-accessible memory.
A lot of the numbers you see with the umm_info() report are block numbers

@d-a-v
Copy link
Collaborator

d-a-v commented Apr 4, 2022

Are these line numbers (935, 672) constant accross the different versions of FW which we can enable in menus / config,
or are they used here as such only for debugging until a final more "general" fixing patch comes ?

@Flole998
Copy link
Contributor Author

Flole998 commented Apr 4, 2022

My patch is unique for each libwpa2.a version. There's currently 3 versions of that file in this repo. However, since the latest NONOS-SDK fixes it we should never need to patch again for future versions. I only modify the libwpa2.a by changing the free-call into a nop.

@d-a-v
Copy link
Collaborator

d-a-v commented Apr 4, 2022

So we have two ways for fixing this bug:

  • by patching libwpa2 by clearing each calls to some free() with a nop
  • by recognizing the caller by its debugging line number

the latest NONOS-SDK fixes it

Are you referring to nonos-sdk-v3 ?

@Flole998
Copy link
Contributor Author

Flole998 commented Apr 4, 2022

Yes, those are the 2 ways. I prefer the first one as for the second one there could be the possibility of a "collision" of line numbers and as those are unnecessary instructions that get executed each time.

With "latest" I mean "latest master" of the SDK on GitHub, the one that's not (yet) part of this repo. I have looked at it in a disassembler just to see if they fixed it. So if someone does add the new SDK 3.0.4 (I think that's the current version) that would probably fix this specific problem aswell.

@mhightower83
Copy link
Contributor

  • by recognizing the caller by its debugging line number

My intent with this approach was just to explore if we might be leaking memory by not freeing at 672.
And, from what I have seen we do not; however, it does appear there are leaks elsewhere with cycling with disconnect connect.
It looks like a NO-OP patched over callx0 should work.

Flole998 added a commit to Flole998/Arduino that referenced this issue Apr 4, 2022
Fixes: esp8266#8082

This patches the callx0 instruction to a nop in eap.o which is part of libwpa2.a.
It looks like espressif fixed the Bug in newer SDK versions, so if we update to the latest NONOS-SDK it is most likely not necessary to add/adapt this patch.
Also modifies the fix_sdk_libs.sh script as it even changed files if no changes were necessary, for example adding multiple system_func1 exports.
@Flole998
Copy link
Contributor Author

Flole998 commented Apr 4, 2022

I've opened a PR for my changes now as it seems to be a good fix and doesn't seem to cause any regressions.

Flole998 added a commit to Flole998/Arduino that referenced this issue Apr 10, 2022
Fixes: esp8266#8082

This patches the callx0 instruction to a nop in eap.o which is part of libwpa2.a.
It looks like espressif fixed the Bug in newer SDK versions, so if we update to the latest NONOS-SDK it is most likely not necessary to add/adapt this patch.
Also modifies the fix_sdk_libs.sh script as it even changed files if no changes were necessary, for example adding multiple system_func1 exports.
@mhightower83
Copy link
Contributor

Two more observations for the WPA2-Enterprise option:

  1. During connect, the system sometimes crashes with HWDT. The SYS stack usage often exceeds 4K. I have seen up to ~4640. Adding disable_extra4k_at_link_time(); made things work better for me.
  2. There are two memory leaks. These can be seen when cycling between connect and disconnect. The storage allocated for Identity and password at connect time is not freed.

@Flole998
Copy link
Contributor Author

For 2.: Did you correctly call the matching clear-methods for all the set-methods? Those are necessary to free the memory according to the SDK documentation.

@mhightower83
Copy link
Contributor

Yes this set:

  wifi_station_clear_enterprise_identity();
  wifi_station_clear_enterprise_password();
  wifi_station_clear_enterprise_cert_key();
  wifi_station_clear_enterprise_ca_cert();
  wifi_station_clear_enterprise_username();
  wifi_station_clear_enterprise_new_password();

It looks like those only free the memory allocated when calling wifi_station_set_enterprise_... functions. The allocations that leak are made later when you connect. For the SDK I am using, they were allocated by return callers: eap.c:775 and eap.c:757

@Flole998
Copy link
Contributor Author

Flole998 commented Apr 13, 2022

You are using username/password, right? I was using certificate based authentication when testing. Line 757 is for the identity, I assume you're using the anonymous identity? Not that it would make things better, if you supply an identity it will just change to line 750, but with the same issue as far as I can see... Looks like eap_peer_config_deinit is incomplete...

Line 775 seems to be for the password.

What we can do to fix this is probably renaming that eap_peer_config_deinit, providing our own version that clears the identity and password (it currently only clears the username if I see that correctly) and new_password, so we are no longer leaking password, identity and new_password (allocated in line 766 if you supply it).

It would explain my 40 Byte leak: 23 for the identity string, 12 for poison, 4 for header, that's 39, must be multiple of 8, so 40. It won't explain why you only saw a 32 byte leak though (you mentioned that earlier, not sure if that's still accurate), especially if you also supplied a password. It also doesn't really explain my 8-40-40...-Pattern.

@mhightower83
Copy link
Contributor

I have been trying TLS, TTLS, and PEAP. They all seem to share some form of leak. PEAP was the one I was last looking at.

The following is for TLS. This is a snippet from the allocation dump I am working on. The [nnnnn] value is the real size specified for the allocation. I think this dump is correct, I have fixed a few issues already to get to this. Take it with a grain of salt.

|          |0x3fff3a3c[00020], 0x4022ec98 0x4022ed60 0x4022ed4c 0x4022f048 - .."@`."@L."@H."@
|          |0x4022f30c File: eap.c:1045 
|          |0x3fff3a64[00064], 0x00000000 0x0000000d 0x3ffeb93c 0x3fff3ab4 - ........<..?.:.?
|          |0x4022e6b1 File: eap.c:206 
|          |0x3fff3ab4[00064], 0x00000000 0x0000001a 0x3ffebc00 0x3fff3b04 - ...........?.;.?
|          |0x4022e6b1 File: eap.c:206 
|          |0x3fff3b04[00064], 0x00000000 0x00000019 0x3ffeb89c 0x3fff3b54 - ...........?T;.?
|          |0x4022e6b1 File: eap.c:206 
|          |0x3fff3b54[00064], 0x00000000 0x00000015 0x3ffeb990 0x00000000 - ...........?....
|          |0x4022e6b1 File: eap.c:206 
|          |0x3fff3c8c[00016], 0x00000000 0x0043afd2 0x4021a048 0x00000000 - ......C.H.!@....
|          |0x4021a419 File: <NULL>:-2 
|          |0x3fff3cac[00023], 0x6e6f6e61 0x756f6d79 0x73654073 0x73657270 - anonymous@espres
|          |0x4022ee35 File: eap.c:757 
|          |0x3fff3eec[00672], 0x3fff42c4 0xffffffff 0x0fd8da88 0x00000000 - .B.?............
|          |0x4021f3e0 File: ieee80211.c:569 
|          |0x3fff419c[00280], 0xffffffff 0x0000ffff 0x00000000 0x00000000 - ................
|          |0x4021f3ff File: ieee80211.c:575 
|          |0x3fff42c4[00068], 0x00000000 0x00000000 0x00000000 0x00000000 - ................
|          |0x40248e77 File: eagle_lwip_if.c:192 
|          |0x3fff431c[00080], 0x00000000 0x3fff37e4 0x00000000 0x3fff37e4 - .....7.?.....7.?
|          |0x40248fde File: eagle_lwip_if.c:242 
|          |0x3fff437c[00052], 0x86d895e5 0x00000000 0x00000001 0x04dd02c7 - ................
|          |0x4021a419 File: <NULL>:-2 
|          |0x3fff43c4[00023], 0x6e6f6e61 0x756f6d79 0x73654073 0x73657270 - anonymous@espres
|          |0x4022ee35 File: eap.c:757 

For TLS, as you go from initial start, Enterprise set, connect, disconnect, and Enterprise clear; total Allocations grew from 19, 24, 32, 28, 28. As you cycle through connect disconnect the number of Identity allocations increases by one.

@Flole998
Copy link
Contributor Author

Well I know what needs to be free'd but I don't know how to do it yet: eap_peer_config_deinit is not exported, it's only a section in the eap.o object, so weak function overrides and that kind of stuff won't work. I tried to compile my own .o (using the xtensa-gcc with the -c option) and then replacing the section using update-section of objcopy, but there's relocation information attached to it, and even if I replace that aswell it won't work and will crash with an illegal instruction.
I think the best way to do this would be to patch that section to include a call to new function that's imported and executed, I don't know how to add a new import and have that functions address placed at a specific location (I have plenty of space as I replace basically an entire function with an "l32r a0, fun" and "callx0 a0" and "ret.n", leaving all the stuff after that unused and available for whatever I want).

That leak also still seems to be there in latest master of the SDK, unlike the double-free which was fixed in the meantime.

d-a-v pushed a commit that referenced this issue Jun 2, 2022
* Fix double-free when connecting to WPA2-Enterprise networks

Fixes: #8082

This patches the callx0 instruction to a nop in eap.o which is part of libwpa2.a.
It looks like espressif fixed the Bug in newer SDK versions, so if we update to the latest NONOS-SDK it is most likely not necessary to add/adapt this patch.
Also modifies the fix_sdk_libs.sh script as it even changed files if no changes were necessary, for example adding multiple system_func1 exports.

* Apply suggestions from code review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants