-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dcd_nrf5x: race condition again #2778
Comments
@rgrr If you would be willing to share code I can help with fixing this problem. I use TinyUSB with mynewt and I'm very much interested in fixing this before it affects me in the future. |
thank you @kasjer. Problem with the code is, that it is owned by my company. I have to strip everything down to a minimum test case. Will again need some time. My gut feeling is currently telling me, that it has something to do with disconnect/connect and pending data transfers. Currently catching the debug output, but if there is bad luck than the problem does not happen with debug log on. |
@rgrr I guess it's still easier for you to strip down code that you know that finds problem then for me to come up with the code that may never trigger bad condition. As soon as you have something to share let me know and we (you and me) can get to the bottom of this. |
thanks again @kasjer ... I agree and hope to find a simple example. Small spoiler: disconnect/connect sequence must be the culprit, because EP=0 as the trace above showed, EP=2 is used for data transfer |
Haha... and I already wanted to check the "friendly" "patch". Now I have a debug log. One which is ok, the other run See yourself if you want. The "ZZZZ" messages are here bool dcd_edpt_xfer(uint8_t rhport, uint8_t ep_addr, uint8_t* buffer, uint16_t total_bytes) {
....
if (control_status) {
kernelprintf("ZZZZ1 %d\n", is_in_isr());
// Status Phase also requires EasyDMA has to be available as well !!!!
edpt_dma_start(&NRF_USBD->TASKS_EP0STATUS);
// The nRF doesn't interrupt on status transmit so we queue up a success response.
dcd_event_xfer_complete(0, ep_addr, 0, XFER_RESULT_SUCCESS, is_in_isr());
kernelprintf("ZZZZ2 %d\n", is_in_isr()); I hoped to see something about interrupts, but did not |
@rgrr I'm processing your logs to understand the issue, if you can share some code it could help (especially host part). If you have traces from USB or logic analyzers it could also help understand timing. My wild guess right now is DMA access which must be serialized and can affect sequence of events when DMA is not ready at some point. |
@kasjer : sorry, both sides are more or less complex. The host side requires a working client, so no easy way. I have reintroduced the disable/enable interrupt around the above sequence and again the problem disappeared. But my first interpretation was wrong: the function is not called when it is already active (never observed active interrupt in this function til now). The actual problem is, that an interrupt between the two calls edpt_dma_start() and dcd_event_xfer_complete() inserts something it should not do. So I'm currently trying to find out, what it could be. And my first try was to exchange the two calls. |
Hmmmm... wondering, if dcd_event_handler() has to be thread safe... |
Exchanging the two call seems to solve my issue. Currently have 3000 iterations of my test loop. With the original code the bug happens latest after 300 iterations. |
@kasjer : Update: it ran all night, had around 20000 CDC disconnect/reconnect cycles (the "bad" original needed around 300+/- cycles to run into the assertion). I count this as verified. Do you have any explanation? If you agree, I will prepare a pull request. |
I see what you mean. I don't see any drawback of moving |
@rgrr could you please modify line TU_LOG_USBD(" Queue EP %02X with %u bytes ...\r\n", ep_addr, total_bytes); to include information returned by |
... and maybe dump content of |
I was expecting some
@rgrr |
For the fifo it's normal, it's using unmasked pointers to avoid 1 space wasted and decople write /read. |
@kasjer : the atomic flag functions used are #define atomic_bool nrfx_atomic_flag_t
#define atomic_flag nrfx_atomic_flag_t
#define atomic_flag_clear(X) nrfx_atomic_flag_clear(X)
#define atomic_flag_test_and_set(X) nrfx_atomic_flag_set_fetch(X) Code is as follows (sorry, I'm no Cortex-M assembler freak):
I have inserted my ZZZZ debug output again, this time including _dcd.dma_running. |
@rgrr could you please show assembler for |
@kasjer here they are:
In the debug output at ZZZZ, _dcd.dma_running is always false. Now compiled with "NRFX_ATOMIC_USE_BUILT_IN=1" and checking. |
This looks OK, so it must be something else. |
I think I understand the problem. Here is what happens:
So changing order of lines seems the right thing to do. |
Now that I know how it works it's easy to reproduce not even data endpoints traffic is needed. if (control_status) {
// Status Phase also requires EasyDMA has to be available as well !!!!
edpt_dma_start(&NRF_USBD->TASKS_EP0STATUS);
TU_LOG2("Not so fast\n");
// The nRF doesn't interrupt on status transmit so we queue up a success response.
dcd_event_xfer_complete(0, ep_addr, 0, XFER_RESULT_SUCCESS, is_in_isr()); Testing app import serial
tty="COM30"
SERIAL_BAUDRATE=1000000
for i in range(0, 20000) :
ser = serial.Serial(port=tty,
baudrate=SERIAL_BAUDRATE, timeout=0.01)
ser.close() |
Haha, that's a simple testcase! And a very complicated sequence for the bug to happen. Should I prepare a PR? |
Please do. |
Operating System
Others
Board
PCA10056
Firmware
Custom firmware which uses CDC-ACM for data transfer. On the other end BLE with old Nordic SDK.
TinyUSB is at the state as of 2024-08-26.
What happened ?
The test procedure does "connect CDC, get BLE device list, connect BLE, transfer little data, disconnect device, disconnect CDC". Not sure if the whole procedure is required, the "disconnect CDC, connect CDC" part seems to be the critical part.
After some iterations (may take a few hundred iterations, each around 15s), TinyUSB has a failed assertion. I nailed that down before already, see #2626, that disable/enable IRQ at the correct place solves the problem. But the reviewer (@kasjer) wasn't happy with the solution, because interrupts were blocked for a long time.
Now there is a little bit more time on my side and I will try to fix this differently.
How to reproduce ?
Test firmware/loop see above
Debug Log as txt file (LOG/CFG_TUSB_DEBUG=2)
I will try to produce a log, currently I can deliver some kind of system state via the debugger.
Failed assertion is in usbd.c:1322, "TU_ASSERT(_usbd_dev.ep_status[epnum][dir].busy == 0)"
If one follows the stack (attached), one can see that
My personal conclusion is, that the static _ctrl_xfer is in an unexpected state, just as if the order of events is important and disrupted by an interrupt which puts its own event in between.
Screenshots
Stack of the assertion:
I have checked existing issues, dicussion and documentation
The text was updated successfully, but these errors were encountered: