dcd_nrf5x: race condition again #2778

rgrr · 2024-08-27T05:36:11Z

Operating System

Others

Board

PCA10056

Firmware

Custom firmware which uses CDC-ACM for data transfer. On the other end BLE with old Nordic SDK.

TinyUSB is at the state as of 2024-08-26.

What happened ?

The test procedure does "connect CDC, get BLE device list, connect BLE, transfer little data, disconnect device, disconnect CDC". Not sure if the whole procedure is required, the "disconnect CDC, connect CDC" part seems to be the critical part.

After some iterations (may take a few hundred iterations, each around 15s), TinyUSB has a failed assertion. I nailed that down before already, see #2626, that disable/enable IRQ at the correct place solves the problem. But the reviewer (@kasjer) wasn't happy with the solution, because interrupts were blocked for a long time.

Now there is a little bit more time on my side and I will try to fix this differently.

How to reproduce ?

Test firmware/loop see above

Debug Log as txt file (LOG/CFG_TUSB_DEBUG=2)

I will try to produce a log, currently I can deliver some kind of system state via the debugger.

Failed assertion is in usbd.c:1322, "TU_ASSERT(_usbd_dev.ep_status[epnum][dir].busy == 0)"

If one follows the stack (attached), one can see that

tud_task_ext(): handles DCD_EVENT_XFER_COMPLETE, ep_addr==0x80, which means, that epnum==0 and ep_dir==1
usbd_control_xfer_cb() runs through until call to _status_stage_xact() (line 208)
_status_stage_xact() now wants to initiate usbd_edpt_xfer() with ep_addr==0
usbd_edpt_xfer() fails because this endpoint is still busy

My personal conclusion is, that the static _ctrl_xfer is in an unexpected state, just as if the order of events is important and disrupted by an interrupt which puts its own event in between.

Screenshots

Stack of the assertion:

I have checked existing issues, dicussion and documentation

I confirm I have checked existing issues, dicussion and documentation.

kasjer · 2024-08-27T06:00:01Z

@rgrr If you would be willing to share code I can help with fixing this problem. I use TinyUSB with mynewt and I'm very much interested in fixing this before it affects me in the future.

rgrr · 2024-08-27T06:19:17Z

thank you @kasjer. Problem with the code is, that it is owned by my company. I have to strip everything down to a minimum test case. Will again need some time.

My gut feeling is currently telling me, that it has something to do with disconnect/connect and pending data transfers.

Currently catching the debug output, but if there is bad luck than the problem does not happen with debug log on.

kasjer · 2024-08-27T06:24:45Z

@rgrr I guess it's still easier for you to strip down code that you know that finds problem then for me to come up with the code that may never trigger bad condition. As soon as you have something to share let me know and we (you and me) can get to the bottom of this.

rgrr · 2024-08-27T06:30:47Z

thanks again @kasjer ... I agree and hope to find a simple example.

Small spoiler: disconnect/connect sequence must be the culprit, because EP=0 as the trace above showed, EP=2 is used for data transfer

rgrr · 2024-08-27T09:03:18Z

Haha... and I already wanted to check the "friendly" "patch".

Now I have a debug log. One which is ok, the other run
ning into the bug. They are more or less the same until line 113 where the buggy log queues another request while the "ok" log completes a transfer.

See yourself if you want.

usbbug.txt
usbok.txt

The "ZZZZ" messages are here

bool dcd_edpt_xfer(uint8_t rhport, uint8_t ep_addr, uint8_t* buffer, uint16_t total_bytes) {
....
  if (control_status) {
    kernelprintf("ZZZZ1 %d\n", is_in_isr());

    // Status Phase also requires EasyDMA has to be available as well !!!!
    edpt_dma_start(&NRF_USBD->TASKS_EP0STATUS);

    // The nRF doesn't interrupt on status transmit so we queue up a success response.
    dcd_event_xfer_complete(0, ep_addr, 0, XFER_RESULT_SUCCESS, is_in_isr());

    kernelprintf("ZZZZ2 %d\n", is_in_isr());

I hoped to see something about interrupts, but did not

kasjer · 2024-08-27T11:53:05Z

@rgrr I'm processing your logs to understand the issue, if you can share some code it could help (especially host part). If you have traces from USB or logic analyzers it could also help understand timing. My wild guess right now is DMA access which must be serialized and can affect sequence of events when DMA is not ready at some point.

rgrr · 2024-08-27T16:38:12Z

@kasjer : sorry, both sides are more or less complex. The host side requires a working client, so no easy way.

I have reintroduced the disable/enable interrupt around the above sequence and again the problem disappeared.

But my first interpretation was wrong: the function is not called when it is already active (never observed active interrupt in this function til now). The actual problem is, that an interrupt between the two calls edpt_dma_start() and dcd_event_xfer_complete() inserts something it should not do.

So I'm currently trying to find out, what it could be. And my first try was to exchange the two calls.

rgrr · 2024-08-27T16:41:23Z

Hmmmm... wondering, if dcd_event_handler() has to be thread safe...

rgrr · 2024-08-27T20:41:40Z

Exchanging the two call seems to solve my issue. Currently have 3000 iterations of my test loop. With the original code the bug happens latest after 300 iterations.

rgrr · 2024-08-28T06:17:40Z

@kasjer : Update: it ran all night, had around 20000 CDC disconnect/reconnect cycles (the "bad" original needed around 300+/- cycles to run into the assertion). I count this as verified.

Do you have any explanation?

If you agree, I will prepare a pull request.

kasjer · 2024-08-28T06:23:07Z

I see what you mean. I don't see any drawback of moving dcd_event_xfer_complete() to be executed before edpt_dma_start(&NRF_USBD->TASKS_EP0STATUS);
I'm still trying to figure out how original order breaks the flow.

kasjer · 2024-08-28T08:22:02Z

@rgrr could you please modify line

TU_LOG_USBD("  Queue EP %02X with %u bytes ...\r\n", ep_addr, total_bytes);

to include information returned by is_in_isr() as you did with ZZZZ ?

kasjer · 2024-08-28T08:30:31Z

... and maybe dump content of _usbd_qdef and _usbd_qdef_buf after assertion failed to see if queue get corrupted somehow

rgrr · 2024-08-28T12:58:15Z

Ok... done that with surprising results:

  TU_LOG_USBD("  Queue EP %02X with %u bytes, is_in_isr %d ...\r\n", ep_addr, total_bytes, xx_is_in_isr());

never shows "in isr"!

Also the queue show some obscure values:

wr_idx/rd_idx are beyond depth!?

PS: OSAL is NONE.

kasjer · 2024-08-28T13:02:18Z

I was expecting some usbd_defer_func() call due to DMA access race but queue does not seem to have any.
Now I'm inclined to thing that atomic_flag_test_and_set() is not working as expected.
It would be nice to check if code ever tries to defer DMA.
Maybe code generated for atomic_flag_test_and_set() in xact_out_dma() is somehow broken
Dump of assembler code for function xact_out_dma:

=> 0x00001968 <+0>:	push	{r4, r5, r6, r7}
   0x0000196a <+2>:	ldr	r1, [pc, #172]	@ (0x1a18 <xact_out_dma+176>)
   0x0000196c <+4>:	dmb	ish
   0x00001970 <+8>:	add.w	r2, r1, #288	@ 0x120
   0x00001974 <+12>:	mov.w	r4, #1				// Block that checks if DMA is in use
   0x00001978 <+16>:	ldrexb	r3, [r2]            //
   0x0000197c <+20>:	strexb	r5, r4, [r2]        //
   0x00001980 <+24>:	cmp	r5, #0                  //
   0x00001982 <+26>:	bne.n	0x1978 <xact_out_dma+16>
   0x00001984 <+28>:	dmb	ish
   0x00001988 <+32>:	uxtb	r3, r3
   0x0000198a <+34>:	cbz	r3, 0x19a4 <xact_out_dma+60>
   0x0000198c <+36>:	ldr	r3, [pc, #140]	@ (0x1a1c <xact_out_dma+180>)
   0x0000198e <+38>:	ldr	r2, [r3, #4]
   0x00001990 <+40>:	ubfx	r2, r2, #0, #9
   0x00001994 <+44>:	subs	r2, #0
   0x00001996 <+46>:	mov	r1, r0
   0x00001998 <+48>:	pop	{r4, r5, r6, r7}
   0x0000199a <+50>:	ldr	r0, [pc, #132]	@ (0x1a20 <xact_out_dma+184>)
   0x0000199c <+52>:	it	ne
   0x0000199e <+54>:	movne	r2, #1
   0x000019a0 <+56>:	b.w	0x1880 <usbd_defer_func>
   0x000019a4 <+60>:	cmp	r0, #8
   0x000019a6 <+62>:	ldr	r3, [pc, #124]	@ (0x1a24 <xact_out_dma+188>)
   0x000019a8 <+64>:	bne.n	0x19e0 <xact_out_dma+120>
   0x000019aa <+66>:	ldr.w	r2, [r3, #1216]	@ 0x4c0
   0x000019ae <+70>:	lsls	r0, r2, #15

@rgrr
Can you verify that your function have similar pattern for atomic flags ?

HiFiPhile · 2024-08-28T14:02:26Z

wr_idx/rd_idx are beyond depth!?

For the fifo it's normal, it's using unmasked pointers to avoid 1 space wasted and decople write /read.

rgrr · 2024-08-29T07:56:29Z

@kasjer : the atomic flag functions used are

#define atomic_bool                    nrfx_atomic_flag_t
#define atomic_flag                    nrfx_atomic_flag_t

#define atomic_flag_clear(X)           nrfx_atomic_flag_clear(X)
#define atomic_flag_test_and_set(X)    nrfx_atomic_flag_set_fetch(X)

Code is as follows (sorry, I'm no Cortex-M assembler freak):

          xact_out_dma:
00031be4:   push    {r7, lr}
00031be6:   mov     r7, sp
00031be8:   sub     sp, #16
00031bea:   strb.w  r0, [r7, #-5]
 193        xfer_td_t* xfer = get_td(epnum, TUSB_DIR_OUT);
00031bee:   ldrb.w  r0, [r7, #-5]
00031bf2:   movs    r1, #0
00031bf4:   bl      0x31b72 <get_td>
00031bf8:   str     r0, [sp, #4]
 198        if (atomic_flag_test_and_set(&_dcd.dma_running)) {
00031bfa:   movw    r0, #13428      @ 0x3474
00031bfe:   movt    r0, #8192       @ 0x2000
00031c02:   add.w   r0, r0, #288    @ 0x120
00031c06:   bl      0x374fc <nrfx_atomic_flag_set_fetch>
00031c0a:   cbz     r0, 0x31c34 <xact_out_dma+80>
00031c0c:   b.n     0x31c0e <xact_out_dma+42>
 199          usbd_defer_func((osal_task_func_t) xact_out_dma_wrapper, (void*) (uint32_t) epnum, is_in_isr());
00031c0e:   ldrb.w  r1, [r7, #-5]
00031c12:   movw    r0, #60676      @ 0xed04
00031c16:   movt    r0, #57344      @ 0xe000
 149        return (SCB->ICSR & SCB_ICSR_VECTACTIVE_Msk) ? true : false;
00031c1a:   ldr     r2, [r0, #0]
00031c1c:   bfc     r2, #9, #23
00031c20:   cmp     r2, #0
00031c22:   it      ne
00031c24:   movne   r2, #1
 199          usbd_defer_func((osal_task_func_t) xact_out_dma_wrapper, (void*) (uint32_t) epnum, is_in_isr());
00031c26:   movw    r0, #7625       @ 0x1dc9
00031c2a:   movt    r0, #3
00031c2e:   bl      0x368a4 <usbd_defer_func>
 200          return;

I have inserted my ZZZZ debug output again, this time including _dcd.dma_running.

kasjer · 2024-08-29T08:32:07Z

@rgrr could you please show assembler for nrfx_atomic_flag_set_fetch() and nrfx_atomic_u32_fetch_or() this later function can be compiled in different variants so it's just to check which way for atomic manipulation is used in your code.
One version can disables interrupts while other may use ldrex instructions

rgrr · 2024-08-29T09:16:52Z

@kasjer here they are:

          nrfx_atomic_u32_fetch_or:
000374a8:   sub     sp, #20
000374aa:   str     r0, [sp, #16]
000374ac:   str     r1, [sp, #12]
 99           NRFX_ATOMIC_OP(orr, old_val, new_val, p_data, value);
000374ae:   ldr     r0, [sp, #16]
000374b0:   ldr     r1, [sp, #12]
000374b2:   ldrex   r2, [r0]
000374b6:   orr.w   r3, r2, r1
000374ba:   strex   r12, r3, [r0]
000374be:   teq     r12, #0
000374c2:   bne.n   0x374b2 <nrfx_atomic_u32_fetch_or+10>
000374c4:   str     r2, [sp, #8]
000374c6:   str     r3, [sp, #4]
000374c8:   str.w   r12, [sp]
101           return old_val;
000374cc:   ldr     r0, [sp, #8]
000374ce:   add     sp, #20
000374d0:   bx      lr

...

          nrfx_atomic_flag_set_fetch:
000374fc:   push    {r7, lr}
000374fe:   mov     r7, sp
00037500:   sub     sp, #8
00037502:   str     r0, [sp, #4]
379           return nrfx_atomic_u32_fetch_or(p_data, 1);
00037504:   ldr     r0, [sp, #4]
00037506:   movs    r1, #1
00037508:   bl      0x374a8 <nrfx_atomic_u32_fetch_or>
0003750c:   add     sp, #8
0003750e:   pop     {r7, pc}

In the debug output at ZZZZ, _dcd.dma_running is always false.

Now compiled with "NRFX_ATOMIC_USE_BUILT_IN=1" and checking.

kasjer · 2024-08-29T09:19:32Z

This looks OK, so it must be something else.

kasjer · 2024-08-29T11:06:03Z

I think I understand the problem.
Your solution to swap lines is correct.

Here is what happens:

Set Line Coding request is received
Task schedules read of data (7 bytes)
Task also schedules write to EP 82 (14 bytes)
Interrupt send notification to Task that 7 bytes transfer is done
Task wants to send status (0 bytes that does not takes DMA for long)
Task sends 0 bytes data packet
Interrupt fires probably for EP 82 informing that packet was received by host and send transfer complete event
Interrupt fires again this time is new setup request (host received 0 bytes packet after all) and sends setup request event
Task continues and sends to itself transfer complete event for Set Line Coding
Task process queue starting from EP 82 transfer complete
Task process out of order setup request Get Line Coding
Task now sees old transfer complete for Set Line Coding but probably confuses it with last request eventually

So changing order of lines seems the right thing to do.
@rgrr Thanks you for finding it and solution.

kasjer · 2024-08-29T12:01:33Z

Now that I know how it works it's easy to reproduce not even data endpoints traffic is needed.
Adding TU_LOG() between those lines will suffice to trigger the problem.

  if (control_status) {
    // Status Phase also requires EasyDMA has to be available as well !!!!
    edpt_dma_start(&NRF_USBD->TASKS_EP0STATUS);

    TU_LOG2("Not so fast\n");

    // The nRF doesn't interrupt on status transmit so we queue up a success response.
    dcd_event_xfer_complete(0, ep_addr, 0, XFER_RESULT_SUCCESS, is_in_isr());

Testing app

import serial

tty="COM30"
SERIAL_BAUDRATE=1000000

for i in range(0, 20000) :
    ser = serial.Serial(port=tty,
                        baudrate=SERIAL_BAUDRATE, timeout=0.01)
    ser.close()

rgrr · 2024-08-29T12:08:37Z

Haha, that's a simple testcase! And a very complicated sequence for the bug to happen.

Should I prepare a PR?

kasjer · 2024-08-29T12:11:14Z

Haha, that's a simple testcase! And a very complicated sequence for the bug to happen.

Should I prepare a PR?

Please do.

hathach · 2024-08-30T11:19:43Z

thank you @rgrr and @kasjer for very effective troubleshooting analysist

rgrr added the Bug 🐞 label Aug 27, 2024

Repository owner deleted a comment Aug 27, 2024

rgrr mentioned this issue Aug 29, 2024

dcd_nrf5x: fix race condition #2782

Merged

HiFiPhile closed this as completed in #2782 Aug 29, 2024

HiFiPhile closed this as completed in 1d2c9f9 Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dcd_nrf5x: race condition again #2778

dcd_nrf5x: race condition again #2778

rgrr commented Aug 27, 2024 •

edited

Loading

kasjer commented Aug 27, 2024

rgrr commented Aug 27, 2024

kasjer commented Aug 27, 2024

rgrr commented Aug 27, 2024

rgrr commented Aug 27, 2024 •

edited

Loading

kasjer commented Aug 27, 2024

rgrr commented Aug 27, 2024

rgrr commented Aug 27, 2024

rgrr commented Aug 27, 2024

rgrr commented Aug 28, 2024

kasjer commented Aug 28, 2024

kasjer commented Aug 28, 2024

kasjer commented Aug 28, 2024

rgrr commented Aug 28, 2024 •

edited

Loading

kasjer commented Aug 28, 2024 •

edited

Loading

HiFiPhile commented Aug 28, 2024

rgrr commented Aug 29, 2024

kasjer commented Aug 29, 2024

rgrr commented Aug 29, 2024

kasjer commented Aug 29, 2024

kasjer commented Aug 29, 2024

kasjer commented Aug 29, 2024

rgrr commented Aug 29, 2024

kasjer commented Aug 29, 2024

hathach commented Aug 30, 2024

dcd_nrf5x: race condition again #2778

dcd_nrf5x: race condition again #2778

Comments

rgrr commented Aug 27, 2024 • edited Loading

Operating System

Board

Firmware

What happened ?

How to reproduce ?

Debug Log as txt file (LOG/CFG_TUSB_DEBUG=2)

Screenshots

I have checked existing issues, dicussion and documentation

kasjer commented Aug 27, 2024

rgrr commented Aug 27, 2024

kasjer commented Aug 27, 2024

rgrr commented Aug 27, 2024

rgrr commented Aug 27, 2024 • edited Loading

kasjer commented Aug 27, 2024

rgrr commented Aug 27, 2024

rgrr commented Aug 27, 2024

rgrr commented Aug 27, 2024

rgrr commented Aug 28, 2024

kasjer commented Aug 28, 2024

kasjer commented Aug 28, 2024

kasjer commented Aug 28, 2024

rgrr commented Aug 28, 2024 • edited Loading

kasjer commented Aug 28, 2024 • edited Loading

HiFiPhile commented Aug 28, 2024

rgrr commented Aug 29, 2024

kasjer commented Aug 29, 2024

rgrr commented Aug 29, 2024

kasjer commented Aug 29, 2024

kasjer commented Aug 29, 2024

kasjer commented Aug 29, 2024

rgrr commented Aug 29, 2024

kasjer commented Aug 29, 2024

hathach commented Aug 30, 2024

rgrr commented Aug 27, 2024 •

edited

Loading

rgrr commented Aug 27, 2024 •

edited

Loading

rgrr commented Aug 28, 2024 •

edited

Loading

kasjer commented Aug 28, 2024 •

edited

Loading