Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dcd_nrf5x: race condition again #2778

Closed
1 task done
rgrr opened this issue Aug 27, 2024 · 25 comments · Fixed by #2782
Closed
1 task done

dcd_nrf5x: race condition again #2778

rgrr opened this issue Aug 27, 2024 · 25 comments · Fixed by #2782
Labels

Comments

@rgrr
Copy link
Contributor

rgrr commented Aug 27, 2024

Operating System

Others

Board

PCA10056

Firmware

Custom firmware which uses CDC-ACM for data transfer. On the other end BLE with old Nordic SDK.

TinyUSB is at the state as of 2024-08-26.

What happened ?

The test procedure does "connect CDC, get BLE device list, connect BLE, transfer little data, disconnect device, disconnect CDC". Not sure if the whole procedure is required, the "disconnect CDC, connect CDC" part seems to be the critical part.

After some iterations (may take a few hundred iterations, each around 15s), TinyUSB has a failed assertion. I nailed that down before already, see #2626, that disable/enable IRQ at the correct place solves the problem. But the reviewer (@kasjer) wasn't happy with the solution, because interrupts were blocked for a long time.

Now there is a little bit more time on my side and I will try to fix this differently.

How to reproduce ?

Test firmware/loop see above

Debug Log as txt file (LOG/CFG_TUSB_DEBUG=2)

I will try to produce a log, currently I can deliver some kind of system state via the debugger.

Failed assertion is in usbd.c:1322, "TU_ASSERT(_usbd_dev.ep_status[epnum][dir].busy == 0)"

If one follows the stack (attached), one can see that

  • tud_task_ext(): handles DCD_EVENT_XFER_COMPLETE, ep_addr==0x80, which means, that epnum==0 and ep_dir==1
  • usbd_control_xfer_cb() runs through until call to _status_stage_xact() (line 208)
  • _status_stage_xact() now wants to initiate usbd_edpt_xfer() with ep_addr==0
  • usbd_edpt_xfer() fails because this endpoint is still busy

My personal conclusion is, that the static _ctrl_xfer is in an unexpected state, just as if the order of events is important and disrupted by an interrupt which puts its own event in between.

Screenshots

Stack of the assertion:

image

I have checked existing issues, dicussion and documentation

  • I confirm I have checked existing issues, dicussion and documentation.
@rgrr rgrr added the Bug 🐞 label Aug 27, 2024
@kasjer
Copy link
Collaborator

kasjer commented Aug 27, 2024

@rgrr If you would be willing to share code I can help with fixing this problem. I use TinyUSB with mynewt and I'm very much interested in fixing this before it affects me in the future.

@rgrr
Copy link
Contributor Author

rgrr commented Aug 27, 2024

thank you @kasjer. Problem with the code is, that it is owned by my company. I have to strip everything down to a minimum test case. Will again need some time.

My gut feeling is currently telling me, that it has something to do with disconnect/connect and pending data transfers.

Currently catching the debug output, but if there is bad luck than the problem does not happen with debug log on.

@kasjer
Copy link
Collaborator

kasjer commented Aug 27, 2024

@rgrr I guess it's still easier for you to strip down code that you know that finds problem then for me to come up with the code that may never trigger bad condition. As soon as you have something to share let me know and we (you and me) can get to the bottom of this.

@rgrr
Copy link
Contributor Author

rgrr commented Aug 27, 2024

thanks again @kasjer ... I agree and hope to find a simple example.

Small spoiler: disconnect/connect sequence must be the culprit, because EP=0 as the trace above showed, EP=2 is used for data transfer

Repository owner deleted a comment Aug 27, 2024
@rgrr
Copy link
Contributor Author

rgrr commented Aug 27, 2024

Haha... and I already wanted to check the "friendly" "patch".

Now I have a debug log. One which is ok, the other run
ning into the bug. They are more or less the same until line 113 where the buggy log queues another request while the "ok" log completes a transfer.

See yourself if you want.

usbbug.txt
usbok.txt

The "ZZZZ" messages are here

bool dcd_edpt_xfer(uint8_t rhport, uint8_t ep_addr, uint8_t* buffer, uint16_t total_bytes) {
....
  if (control_status) {
    kernelprintf("ZZZZ1 %d\n", is_in_isr());

    // Status Phase also requires EasyDMA has to be available as well !!!!
    edpt_dma_start(&NRF_USBD->TASKS_EP0STATUS);

    // The nRF doesn't interrupt on status transmit so we queue up a success response.
    dcd_event_xfer_complete(0, ep_addr, 0, XFER_RESULT_SUCCESS, is_in_isr());

    kernelprintf("ZZZZ2 %d\n", is_in_isr());

I hoped to see something about interrupts, but did not

@kasjer
Copy link
Collaborator

kasjer commented Aug 27, 2024

@rgrr I'm processing your logs to understand the issue, if you can share some code it could help (especially host part). If you have traces from USB or logic analyzers it could also help understand timing. My wild guess right now is DMA access which must be serialized and can affect sequence of events when DMA is not ready at some point.

@rgrr
Copy link
Contributor Author

rgrr commented Aug 27, 2024

@kasjer : sorry, both sides are more or less complex. The host side requires a working client, so no easy way.

I have reintroduced the disable/enable interrupt around the above sequence and again the problem disappeared.

But my first interpretation was wrong: the function is not called when it is already active (never observed active interrupt in this function til now). The actual problem is, that an interrupt between the two calls edpt_dma_start() and dcd_event_xfer_complete() inserts something it should not do.

So I'm currently trying to find out, what it could be. And my first try was to exchange the two calls.

@rgrr
Copy link
Contributor Author

rgrr commented Aug 27, 2024

Hmmmm... wondering, if dcd_event_handler() has to be thread safe...

@rgrr
Copy link
Contributor Author

rgrr commented Aug 27, 2024

Exchanging the two call seems to solve my issue. Currently have 3000 iterations of my test loop. With the original code the bug happens latest after 300 iterations.

@rgrr
Copy link
Contributor Author

rgrr commented Aug 28, 2024

@kasjer : Update: it ran all night, had around 20000 CDC disconnect/reconnect cycles (the "bad" original needed around 300+/- cycles to run into the assertion). I count this as verified.

Do you have any explanation?

If you agree, I will prepare a pull request.

@kasjer
Copy link
Collaborator

kasjer commented Aug 28, 2024

I see what you mean. I don't see any drawback of moving dcd_event_xfer_complete() to be executed before edpt_dma_start(&NRF_USBD->TASKS_EP0STATUS);
I'm still trying to figure out how original order breaks the flow.

@kasjer
Copy link
Collaborator

kasjer commented Aug 28, 2024

@rgrr could you please modify line

TU_LOG_USBD("  Queue EP %02X with %u bytes ...\r\n", ep_addr, total_bytes);

to include information returned by is_in_isr() as you did with ZZZZ ?

@kasjer
Copy link
Collaborator

kasjer commented Aug 28, 2024

... and maybe dump content of _usbd_qdef and _usbd_qdef_buf after assertion failed to see if queue get corrupted somehow

@rgrr
Copy link
Contributor Author

rgrr commented Aug 28, 2024

Ok... done that with surprising results:

  TU_LOG_USBD("  Queue EP %02X with %u bytes, is_in_isr %d ...\r\n", ep_addr, total_bytes, xx_is_in_isr());

never shows "in isr"!

Also the queue show some obscure values:

image

image

wr_idx/rd_idx are beyond depth!?

PS: OSAL is NONE.

@kasjer
Copy link
Collaborator

kasjer commented Aug 28, 2024

I was expecting some usbd_defer_func() call due to DMA access race but queue does not seem to have any.
Now I'm inclined to thing that atomic_flag_test_and_set() is not working as expected.
It would be nice to check if code ever tries to defer DMA.
Maybe code generated for atomic_flag_test_and_set() in xact_out_dma() is somehow broken
Dump of assembler code for function xact_out_dma:

=> 0x00001968 <+0>:	push	{r4, r5, r6, r7}
   0x0000196a <+2>:	ldr	r1, [pc, #172]	@ (0x1a18 <xact_out_dma+176>)
   0x0000196c <+4>:	dmb	ish
   0x00001970 <+8>:	add.w	r2, r1, #288	@ 0x120
   0x00001974 <+12>:	mov.w	r4, #1				// Block that checks if DMA is in use
   0x00001978 <+16>:	ldrexb	r3, [r2]            //
   0x0000197c <+20>:	strexb	r5, r4, [r2]        //
   0x00001980 <+24>:	cmp	r5, #0                  //
   0x00001982 <+26>:	bne.n	0x1978 <xact_out_dma+16>
   0x00001984 <+28>:	dmb	ish
   0x00001988 <+32>:	uxtb	r3, r3
   0x0000198a <+34>:	cbz	r3, 0x19a4 <xact_out_dma+60>
   0x0000198c <+36>:	ldr	r3, [pc, #140]	@ (0x1a1c <xact_out_dma+180>)
   0x0000198e <+38>:	ldr	r2, [r3, #4]
   0x00001990 <+40>:	ubfx	r2, r2, #0, #9
   0x00001994 <+44>:	subs	r2, #0
   0x00001996 <+46>:	mov	r1, r0
   0x00001998 <+48>:	pop	{r4, r5, r6, r7}
   0x0000199a <+50>:	ldr	r0, [pc, #132]	@ (0x1a20 <xact_out_dma+184>)
   0x0000199c <+52>:	it	ne
   0x0000199e <+54>:	movne	r2, #1
   0x000019a0 <+56>:	b.w	0x1880 <usbd_defer_func>
   0x000019a4 <+60>:	cmp	r0, #8
   0x000019a6 <+62>:	ldr	r3, [pc, #124]	@ (0x1a24 <xact_out_dma+188>)
   0x000019a8 <+64>:	bne.n	0x19e0 <xact_out_dma+120>
   0x000019aa <+66>:	ldr.w	r2, [r3, #1216]	@ 0x4c0
   0x000019ae <+70>:	lsls	r0, r2, #15

@rgrr
Can you verify that your function have similar pattern for atomic flags ?

@HiFiPhile
Copy link
Collaborator

wr_idx/rd_idx are beyond depth!?

For the fifo it's normal, it's using unmasked pointers to avoid 1 space wasted and decople write /read.

@rgrr
Copy link
Contributor Author

rgrr commented Aug 29, 2024

@kasjer : the atomic flag functions used are

#define atomic_bool                    nrfx_atomic_flag_t
#define atomic_flag                    nrfx_atomic_flag_t

#define atomic_flag_clear(X)           nrfx_atomic_flag_clear(X)
#define atomic_flag_test_and_set(X)    nrfx_atomic_flag_set_fetch(X)

Code is as follows (sorry, I'm no Cortex-M assembler freak):

          xact_out_dma:
00031be4:   push    {r7, lr}
00031be6:   mov     r7, sp
00031be8:   sub     sp, #16
00031bea:   strb.w  r0, [r7, #-5]
 193        xfer_td_t* xfer = get_td(epnum, TUSB_DIR_OUT);
00031bee:   ldrb.w  r0, [r7, #-5]
00031bf2:   movs    r1, #0
00031bf4:   bl      0x31b72 <get_td>
00031bf8:   str     r0, [sp, #4]
 198        if (atomic_flag_test_and_set(&_dcd.dma_running)) {
00031bfa:   movw    r0, #13428      @ 0x3474
00031bfe:   movt    r0, #8192       @ 0x2000
00031c02:   add.w   r0, r0, #288    @ 0x120
00031c06:   bl      0x374fc <nrfx_atomic_flag_set_fetch>
00031c0a:   cbz     r0, 0x31c34 <xact_out_dma+80>
00031c0c:   b.n     0x31c0e <xact_out_dma+42>
 199          usbd_defer_func((osal_task_func_t) xact_out_dma_wrapper, (void*) (uint32_t) epnum, is_in_isr());
00031c0e:   ldrb.w  r1, [r7, #-5]
00031c12:   movw    r0, #60676      @ 0xed04
00031c16:   movt    r0, #57344      @ 0xe000
 149        return (SCB->ICSR & SCB_ICSR_VECTACTIVE_Msk) ? true : false;
00031c1a:   ldr     r2, [r0, #0]
00031c1c:   bfc     r2, #9, #23
00031c20:   cmp     r2, #0
00031c22:   it      ne
00031c24:   movne   r2, #1
 199          usbd_defer_func((osal_task_func_t) xact_out_dma_wrapper, (void*) (uint32_t) epnum, is_in_isr());
00031c26:   movw    r0, #7625       @ 0x1dc9
00031c2a:   movt    r0, #3
00031c2e:   bl      0x368a4 <usbd_defer_func>
 200          return;

I have inserted my ZZZZ debug output again, this time including _dcd.dma_running.

@kasjer
Copy link
Collaborator

kasjer commented Aug 29, 2024

@rgrr could you please show assembler for nrfx_atomic_flag_set_fetch() and nrfx_atomic_u32_fetch_or() this later function can be compiled in different variants so it's just to check which way for atomic manipulation is used in your code.
One version can disables interrupts while other may use ldrex instructions

@rgrr
Copy link
Contributor Author

rgrr commented Aug 29, 2024

@kasjer here they are:

          nrfx_atomic_u32_fetch_or:
000374a8:   sub     sp, #20
000374aa:   str     r0, [sp, #16]
000374ac:   str     r1, [sp, #12]
 99           NRFX_ATOMIC_OP(orr, old_val, new_val, p_data, value);
000374ae:   ldr     r0, [sp, #16]
000374b0:   ldr     r1, [sp, #12]
000374b2:   ldrex   r2, [r0]
000374b6:   orr.w   r3, r2, r1
000374ba:   strex   r12, r3, [r0]
000374be:   teq     r12, #0
000374c2:   bne.n   0x374b2 <nrfx_atomic_u32_fetch_or+10>
000374c4:   str     r2, [sp, #8]
000374c6:   str     r3, [sp, #4]
000374c8:   str.w   r12, [sp]
101           return old_val;
000374cc:   ldr     r0, [sp, #8]
000374ce:   add     sp, #20
000374d0:   bx      lr

...

          nrfx_atomic_flag_set_fetch:
000374fc:   push    {r7, lr}
000374fe:   mov     r7, sp
00037500:   sub     sp, #8
00037502:   str     r0, [sp, #4]
379           return nrfx_atomic_u32_fetch_or(p_data, 1);
00037504:   ldr     r0, [sp, #4]
00037506:   movs    r1, #1
00037508:   bl      0x374a8 <nrfx_atomic_u32_fetch_or>
0003750c:   add     sp, #8
0003750e:   pop     {r7, pc}

In the debug output at ZZZZ, _dcd.dma_running is always false.

Now compiled with "NRFX_ATOMIC_USE_BUILT_IN=1" and checking.

@kasjer
Copy link
Collaborator

kasjer commented Aug 29, 2024

This looks OK, so it must be something else.

@kasjer
Copy link
Collaborator

kasjer commented Aug 29, 2024

I think I understand the problem.
Your solution to swap lines is correct.

Here is what happens:

  • Set Line Coding request is received
  • Task schedules read of data (7 bytes)
  • Task also schedules write to EP 82 (14 bytes)
  • Interrupt send notification to Task that 7 bytes transfer is done
  • Task wants to send status (0 bytes that does not takes DMA for long)
  • Task sends 0 bytes data packet
  • Interrupt fires probably for EP 82 informing that packet was received by host and send transfer complete event
  • Interrupt fires again this time is new setup request (host received 0 bytes packet after all) and sends setup request event
  • Task continues and sends to itself transfer complete event for Set Line Coding
  • Task process queue starting from EP 82 transfer complete
  • Task process out of order setup request Get Line Coding
  • Task now sees old transfer complete for Set Line Coding but probably confuses it with last request eventually

So changing order of lines seems the right thing to do.
@rgrr Thanks you for finding it and solution.

@kasjer
Copy link
Collaborator

kasjer commented Aug 29, 2024

Now that I know how it works it's easy to reproduce not even data endpoints traffic is needed.
Adding TU_LOG() between those lines will suffice to trigger the problem.

  if (control_status) {
    // Status Phase also requires EasyDMA has to be available as well !!!!
    edpt_dma_start(&NRF_USBD->TASKS_EP0STATUS);

    TU_LOG2("Not so fast\n");

    // The nRF doesn't interrupt on status transmit so we queue up a success response.
    dcd_event_xfer_complete(0, ep_addr, 0, XFER_RESULT_SUCCESS, is_in_isr());

Testing app

import serial

tty="COM30"
SERIAL_BAUDRATE=1000000

for i in range(0, 20000) :
    ser = serial.Serial(port=tty,
                        baudrate=SERIAL_BAUDRATE, timeout=0.01)
    ser.close()

@rgrr
Copy link
Contributor Author

rgrr commented Aug 29, 2024

Haha, that's a simple testcase! And a very complicated sequence for the bug to happen.

Should I prepare a PR?

@kasjer
Copy link
Collaborator

kasjer commented Aug 29, 2024

Haha, that's a simple testcase! And a very complicated sequence for the bug to happen.

Should I prepare a PR?

Please do.

@hathach
Copy link
Owner

hathach commented Aug 30, 2024

thank you @rgrr and @kasjer for very effective troubleshooting analysist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants