Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robustness fixes for netstack #44

Closed
wants to merge 5 commits into from
Closed

Robustness fixes for netstack #44

wants to merge 5 commits into from

Conversation

adamgreen
Copy link
Contributor

I welcome feedback on all of the updates included in this request but I would really like to get opinions on one in particular. Commit 4813d58 contains a fix for an issue that I saw where the maximum token count was exceeded on the semaphore used to communicate between the ethernet ISR and the packet_rx() thread. My current fix is to have packet_rx() consume only one packet per semaphore signal event instead of multiple as it did before. I wonder if it would be better to keep the original loop to handle all received packets when the packet_rx() thread is awaken but to use a Signal instead of a Semaphore. Thoughts?

The other changes include:

  • Fully disable LWIP_ASSERTs by defining LWIP_NOASSERT macro in lwipopts.h
  • Reset pbuf length when re-queueing on error. Previously the re-queued buffer size would have been adjusted to the size of the previously received packet instead of the full allocated size of the pbuf object.
  • Enable the SYS_LIGHTWEIGHT_PROT macro in lwipopts.h. This is also a very big change! It turns out that a lot of the global data structures in lwIP weren't being protected by synchronization objects and therefore were not thread safe.

Please refer to the commit descriptions for more details on each of these changes.

I have tested these networking changes with a few samples that I have building with gcc4mbed and also the NET_1 test built and tested with the Python scripts in the mbed tree itself. I would run more of the tests from the mbed build system but they don't seem to work with GCC_ARM due to what I believe is a _sbrk() issue.

Thanks for considering these changes!

I was doing some debugging that had me looking at the disassembly of
lpc_rx_queue() from within the debugger.  I was looking for the call to
pbuf_alloc() that we see in the following code snippet:
		p = pbuf_alloc(PBUF_RAW, (u16_t) EMAC_ETH_MAX_FLEN, PBUF_RAM);
		if (p == NULL) {
			LWIP_DEBUGF(UDP_LPC_EMAC | LWIP_DBG_TRACE,
				("lpc_rx_queue: could not allocate RX pbuf (free desc=%d)\n",
				lpc_enetif->rx_free_descs));
			return queued;
		}

		/* pbufs allocated from the RAM pool should be non-chained. */
		LWIP_ASSERT("lpc_rx_queue: pbuf is not contiguous (chained)",
			pbuf_clen(p) <= 1);

When I was looking through the disassembly for this code I noticed a
call to pbuf_clen() in the actual machine code.
=> 0x0000bab0 <+24>:	bl	0x44c0 <pbuf_clen>
   0x0000bab4 <+28>:	ldr	r3, [r4, #112]	; 0x70
   0x0000bab6 <+30>:	ldrh.w	r12, [r5, #10]
   0x0000baba <+34>:	add.w	r2, r3, #9
   0x0000babe <+38>:	add.w	r0, r12, #4294967295	; 0xffffffff

The only call to pbuf_clean made from this function is made from
within the LWIP_ASSERT.  When I looked more closely at how this macro
was defined, I saw that the mbed version of the stack had disabled the
LWIP_PLATFORM_ASSERT macro when LWIP_DEBUG was false which means that
no action will be taken if the assert is false but it still allows the
LWIP_ASSERT macro to potentially evaluate the assert expression.
Defining the LWIP_NOASSERT macro will fully disable the LWIP_ASSERT
macro.

I saw one of my TCP/IP samples shrink about 0.5K when I made this
change.
After making my previous commit to completely disable LWIP_ASSERT
macro invocations, I ended up with a warning in pbuf.c where an
err variable was set but only checked for success in an assert.  I
added a "(void)err;" reference to silence this warning.
This option actually enables the use of the lwip_sys_mutex for
protecting concurrent access to such important lwIP resources as:
  select_cb_list (this is the one which orig flagged problem)
  sockets array
  mem stats (if enabled)
  heap (if LWIP_ALLOW_MEM_FREE_FROM_OTHER_CONTEXT was non-zero)
  memp pool allocs/frees
  netif->loop_last pbuf linked list
  pbuf reference counts
  ...

I first noticed this issue when I hit a crash while slamming the net
stack with a large number of TCP packets (I was actually sending 1k
data buffers from the TCPEchoServer mbed sample.)  It crashed in the
last line of this code snippet from event_callback:
  for (scb = select_cb_list; scb != NULL; scb = scb->next) {
    if (scb->sem_signalled == 0) {

It was crashing because scb had an invalid address so it generated a
bus fault.  I figured that memory was either corrupted or there was
some kind of concurrency issue.  In trying to determine which, I wanted
to walk through the select_cb_list linked list and see where it was
corrupted:
    (gdb) p scb
    $1 = (struct lwip_select_cb *) 0x85100080
    (gdb) p select_cb_list
    $2 = (struct lwip_select_cb *) 0x0

That was interesting, the head of the linked list was now NULL but it
must have had a non-NULL value when this loop started running or we
would have never gotten to the point where we hit this crash.

This was starting to look like a concurrency issue since the linked
list was modified out from underneath this thread.  Looking through the
source code for this function, I saw use of macros like
SYS_ARCH_PROTECT and SYS_ARCH_UNPROTECT which looked like they should
be providing the thead synchronization.  I disassembled the
event_callback() function in the debugger and saw no signs of the
usage of synchronizition APIs that I expected.  A search
through the code for the definition of these SYS_ARCH_UN/PROTECT
macros led me to discovering that they were actually ignored unless an
implementation defined them itself (the mbed version doesn't do so) or
the SYS_LIGHTWEIGHT_PROT macro is set to non-zero (the mbed version
didn't do this either).  Flipping the SYS_LIGHTWEIGHT_PROT macro on in
lwipopts.h fixed the crash I kept hitting, increased the size of the
code a bit, and unfortunately slows things down a bit since it now
actually serializes access to these data structures by making calls
to the RTOS sync APIs.
Previously the packet_rx() function would wait on the RxSem and when
signalled it would process all available inbound packets.  This used to
cause no problem but once the thread synchronization was turned
on via SYS_LIGHTWEIGHT_PROT, the semaphore actually started to overflow
its maximum token count of 65535.  This caused the mbed_die() flashing
LEDs of death.  The old code was really breaking the producer/consumer
pattern that I typically see with a semaphore since the consumer was
written to consume more than 1 produced object per semaphore wait.
Before the thread synchronization was enabled, the packet_rx() thread
could use a single time slice to process all of these packets and then
loop back around a few more times to decrement the semaphore count
while skipping the packet processing since it had all been done.
Now the packet processing code would cause the thread to give up its
time slice as it hit newly enabled critical sections.  In the end it
was possible for the code to leak 2 semaphore signals for every 1 by
which the thread was awaken.  After about 10 seconds of load, this
would cause a leak of 65535 signals.

NOTE: Two potential issues with this change:
1) The LPC_EMAC->RxConsumeIndex != LPC_EMAC->RxProduceIndex check was
   removed from packet_rx().  I believe that this is Ok since the same
   condition is later checked in lpc_low_level_input() anyway so it
   won't now try to process more packets than what exist.
2) What if ENET_IRQHandler(void) ends up not signalling the RxSem for
   every packet received?  When would that happen?  I could see it
   happening if the ethernet hardware would try to pend more than 1
   interrupt when the priority was too elevated to process the
   pending requests.  Putting the consumer loop back in packet_rx()
   and using a Signal instead of a Semaphore might be a better
   solution?
I recently pulled a NXP crash fix for their ethernet driver which will
requeue a pbuf to the ethernet driver rather than sending it to the
lwip stack if it can't allocate a new pbuf to keep the ethernet
hardware primed with available packet buffers.  While recently
reviewing this code I noticed that the full size of the pbuf wasn't
used on this re-queueing operation but the size of the last received
packet.  I now reset the pbuf size back to its originally allocated
size before doing this requeue operation.
@adamgreen
Copy link
Contributor Author

Pablo was buddy testing this fix for me and indicated that he has hit a new hang. I need to investigate that so I am closing out this pull request.

@adamgreen adamgreen closed this Aug 26, 2013
pan- added a commit to pan-/mbed that referenced this pull request May 16, 2018
Security Manager PAL: Privacy implementation for Cordio
geky pushed a commit to geky/mbed that referenced this pull request Aug 25, 2018
linlingao added a commit to linlingao/mbed-os that referenced this pull request Jul 12, 2019
pan- pushed a commit to pan-/mbed that referenced this pull request May 29, 2020
Jookia pushed a commit to Jookia/mbed-os that referenced this pull request Mar 11, 2023
…ARMmbed#57)

* Fix overflow issue causing common tickers test to intermittently fail (ARMmbed#44)

* Tabs -> spaces
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant