Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realtek RTL8195AM - CMSIS-RTOS error: ISR Queue overflow (status: 0x2, task ID: 0x0, object ID: 0x30051484) #5640

Closed
JanneKiiskila opened this issue Dec 2, 2017 · 26 comments

Comments

@JanneKiiskila
Copy link
Contributor

JanneKiiskila commented Dec 2, 2017

Description

  • Type: Bug
  • Priority: Blocker for releasing support for Mbed Cloud Client / mbed-os-example-client

Bug

Target
REALTEK_RTL8195AM

Toolchain:
GCC_ARM

Toolchain version:
mbed cli Windows installed toolchain 0.43
gcc_arm - same with Linux as well.

mbed-cli version:
(mbed --version)
1.2.2

mbed-os sha:
(git log -n1 --oneline)

2e1c2a1 (HEAD -> master, origin/master, origin/feature-lorawan, origin/HEAD) Merge pull request #5538 from geky/littlefs-staging
41591eb Merge pull request #5602 from artokin/nanostack_release_v704

DAPLink version:

241 - this one is also a bit old
It would be nice if Realtek could update the official DAPLINK you can download via their website.

=========================================================

ROM Version: 0.3

Build ToolChain Version: gcc version 4.8.3 (Realtek ASDK-4.8.3p1 Build 2003)

=========================================================
Check boot type form eFuse
SPI Initial
Image1 length: 0x3308, Image Addr: 0x10000bc8
Image1 Validate OK, Going jump to Image1

Expected behavior

mbed-os-example-client can run for a very long time.
For example reference testing was done with K64F, it works fine.

Actual behavior

...
simulate button_click, new value of counter is 280
simulate button_click, new value of counter is 281
simulate button_click, new value of counter is 282
simulate button_click, new value of counter is 283
CMSIS-RTOS error: ISR Queue overflow (status: 0x2, task ID: 0x0, object ID: 0x30051484)

[mbed_die]  0x0 die here

Steps to reproduce

git clone mbed-os-example-client
modify mbed_app.json with a valid SSID/WIFI-passphrase.
set connectivity method to `WIFI_RTW`
mbed compile -m REALTEK_RTL8195AM -t GCC_ARM

(Have not tried other compilers, though).

@JanneKiiskila JanneKiiskila changed the title Realtek RTL8195AM - Realtek RTL8195AM - CMSIS-RTOS error: ISR Queue overflow (status: 0x2, task ID: 0x0, object ID: 0x30051484) Dec 2, 2017
@JanneKiiskila
Copy link
Contributor Author

JanneKiiskila commented Dec 2, 2017

@tung7970 @Archcady

[Mirrored to Jira]

@samchuarm
Copy link

samchuarm commented Dec 4, 2017

@ARMmbed/team-realtek
[Mirrored to Jira]

@Archcady
Copy link
Contributor

Archcady commented Dec 4, 2017

looking into this issue, will update ASAP.
[Mirrored to Jira]

@samchuarm
Copy link

samchuarm commented Dec 13, 2017

@Archcady
[Mirrored to Jira]

@Archcady
Copy link
Contributor

Archcady commented Dec 20, 2017

I am trying to test and debug this issue, but it takes a long time to reproduce. Is there a way to make the handle_timer_click trigger faster so that I can test and debug the point where it fails?
[Mirrored to Jira]

@tung7970
Copy link
Contributor

tung7970 commented Dec 20, 2017

@Archcady You can try to reduce the wait period in main.cpp. Currently it's 25 secs per loop.

updates.wait(25000);

[Mirrored to Jira]

@Archcady
Copy link
Contributor

Archcady commented Dec 20, 2017

I updated this to 1000, still the time taken for the trigger is a lot

[Mirrored to Jira]

@JanneKiiskila
Copy link
Contributor Author

JanneKiiskila commented Dec 20, 2017

We will be talking of hours, not days even with the default value.

[Mirrored to Jira]

@Archcady
Copy link
Contributor

Archcady commented Jan 3, 2018

Have debugged this issue, the ISR queue overflow is happening as the result of the function call "release()" in Semaphore.cpp which in turn calls "osSemaphoreRelease". I checkked our driver from the realtek side, we are not calling the "release()" function at all in our code. It would be great if anyone from ARM side could help me understand this issue. These are my findings so far.

P.S. We are unable to debug with pyOCD as it is giving some error hence debugging is taking an extended amount of time.
[Mirrored to Jira]

@JanneKiiskila
Copy link
Contributor Author

JanneKiiskila commented Jan 24, 2018

Hei,

you are 100% sure you are not using semaphores anywhere else? I can see at least these calls using git grep under TARGET_Realtek.

TARGET_AMEBA/sdk/os/rtx2/rtx2_service.c:                osStatus_t status = osSemaphoreRelease(p_sem->id);
TARGET_AMEBA/sdk/os/rtx2/rtx2_service.c:                osStatus_t status = osSemaphoreRelease(p_sem->id);
T

Any unbalance in acquiring those semaphores vs. releasing could potentially cause an issue, right?

[Mirrored to Jira]

@MarceloSalazar
Copy link

MarceloSalazar commented Feb 1, 2018

@Archcady is there any update on this?
[Mirrored to Jira]

@samchuarm
Copy link

samchuarm commented Feb 2, 2018

Hi Marcelo, Realtek team is still trying to narrow down where the semaphore mismatch might come from.
[Mirrored to Jira]

@bkht
Copy link

bkht commented Feb 7, 2018

Hi, I have reproduced the same problem.
Using the on-line compiler, I have successfully run on a NUCLEO-F746ZG: Getting started with mbed Client on mbed OS https://os.mbed.com/teams/mbed-os-examples/code/mbed-os-example-client/
More info:
https://os.mbed.com/questions/80121/mbed-Client-on-mbed-OS-CMSIS-RTOS-error-/

I found that using some library (temperature sensor), to get some real-world data, causes this problem as soon as that library gets called, say mcp9808.readTemp().
That library works fine in a simple program.

[Mirrored to Jira]

@samchuarm
Copy link

samchuarm commented Feb 8, 2018

So does this mean the ISR queue overflow is not platform dependent? @JanneKiiskila
@Archcady any progress on this issue?
[Mirrored to Jira]

@prashantrar
Copy link
Contributor

prashantrar commented Feb 9, 2018

From realtek side, we are still debugging the issue, but ill share some of my findings here in case there could be some pointers.

  1. The ISR queue overflow happens always at a fixed amount of time, it takes approx 62-63mins for it to occur every single time.

  2. the issue originates in the function "osRtxPostProcess" in Rtx_system.c where "osRtxErrorNotify" is called and the program terminates.
    void osRtxPostProcess (os_object_t *object) { if (isr_queue_put(object) != 0U) { if (osRtxInfo.kernel.blocked == 0U) { SetPendSV(); } else { osRtxInfo.kernel.pendSV = 1U; } } else { osRtxErrorNotify(osRtxErrorISRQueueOverflow, object); } }

  3. The reason why "osRtxErrorNotify" is called is because just before the crash inside the function "" the "if" condition gets executed 16times.
    if (isr_queue_put(object) != 0U) { if (osRtxInfo.kernel.blocked == 0U) { SetPendSV(); }
    the "kernel.blocked" check fails and hence the same condition gets called 16 tines, and the size of the ISR queue defined is 16 and hence the queue overflows.

  4. Surprisingly in the function "osRtxPostProcess " if i comment the call to "osRtxErrorNotify" then the program runs forever without issues.

  5. Also in case I modify the example code and make it such that the semaphore release is done with a software timer rather than using the ticker, this issue dosent happen. Only when the ticker is used to release the semaphore, this issue is reproducible.

P.S. I am still debugging the issue, these are just my findings, if anyone from arm team could get some pointers from these findings and highlight anything that I am missing, please kindly help out.
[Mirrored to Jira]

@JanneKiiskila
Copy link
Contributor Author

JanneKiiskila commented Feb 9, 2018

There is something that's board specific (or driver specific) - K64F does not have this issue.

But, clearly it's now something that's impacting more than one board, if this happens also with NUCLEO-F746ZG.

[Mirrored to Jira]

@samchuarm
Copy link

samchuarm commented Feb 13, 2018

Hi @JanneKiiskila do you think can commenting out osRtxErrorNotify in osRtxPostProcess or switching to use software timer in semaphore release be the fix?
[Mirrored to Jira]

@JanneKiiskila
Copy link
Contributor Author

JanneKiiskila commented Feb 13, 2018

I will admit my own limited knowledge at this stage and say I don't know. @kjbracey-arm, @geky, @sg- , or other Mbed OS team members would know better.

[Mirrored to Jira]

@kjbracey
Copy link
Contributor

kjbracey commented Feb 14, 2018

This is something it's quite easy to hit in RTX when using any RTOS operations from interrupt context. The RTOS work is always deferred onto this queue, so if you do 16 consecutive RTOS operations from interrupt before returning to thread context, it overflows.

I've raised one issue here for RTX suggesting how this could be improved, at least for flags. Not sure if the same logic could apply to semaphores. Maybe?

ARM-software/CMSIS_5#283

Pending any RTX improvement, it's usually best to work around the issue by including logic to make sure you don't signal multiple consecutive times from interrupt. Some sort of "pending" flag which is cleared by the person who is monitoring the semaphore.

Do we really have no information about where the interrupt-context semaphore release triggering this is is coming from? No backtrace?

[Mirrored to Jira]

@samchuarm
Copy link

samchuarm commented Feb 26, 2018

@prashantrar @ARMmbed/team-realtek
[Mirrored to Jira]

@prashantrar
Copy link
Contributor

prashantrar commented Feb 26, 2018

We are having difficulty taking backtraces because the second the crash happens the stack is corrupt, but it originates from semaphore release all the time, beyond this the backtrace is unable to point out to specific functions usually just shows " ?? ()" in the backtrace. I will try to get proper backtraces once again tomorrow and update this ticket.
[Mirrored to Jira]

@prashantrar
Copy link
Contributor

prashantrar commented Mar 2, 2018

@kjbracey-arm I am updating the latest backtrace with all the latest mbed-os components.

#0  osRtxErrorNotify () at .\mbed-os\rtos\TARGET_CORTEX\mbed_rtx_handlers.c  
#1  0x3001bb74 in isrRtxSemaphoreRelease ()  
at .\mbed-os\rtos\TARGET_CORTEX\rtx5\RTX\Source\rtx_semaphore.c:414  
#2  osSemaphoreRelease ()  
at .\mbed-os\rtos\TARGET_CORTEX\rtx5\RTX\Source\rtx_semaphore.c:461  
#3  0x300193f2 in ticker_irq_handler () at .\mbed-os\hal\mbed_ticker_api.c:  
#4  0x30022ff4 in HalTimerIrq2To7Handle_Patch (Data=<optimized out>)  
at ../../TARGET_Realtek/TARGET_AMEBA/TARGET_RTL8195A/device/rtl8195a_ti  
:45  
#5  0x000035de in ?? ()  
Backtrace stopped: previous frame identical to this frame (corrupt stack?)  

[Mirrored to Jira]

@0xc0170
Copy link
Contributor

0xc0170 commented Jul 25, 2018

@ARMmbed/team-realtek @JanneKiiskila Is this still a blocker and issue has not yet been fixed?
[Mirrored to Jira]

@samchuarm
Copy link

samchuarm commented Sep 20, 2018

@M-ichae-l , can you confirm if this issue has been addressed?
[Mirrored to Jira]

@ARMmbed ARMmbed deleted a comment from ciarmcom Oct 2, 2018
@adbridge
Copy link
Contributor

adbridge commented Oct 4, 2018

Internal Jira reference: https://jira.arm.com/browse/IOTPART-5928

@MarceloSalazar
Copy link

Closing as target won't be supported in Mbed 6 - #12775

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests