Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frequent OOM on 842N v2 #1197

Closed
azrdev opened this issue Jul 24, 2017 · 27 comments
Closed

frequent OOM on 842N v2 #1197

azrdev opened this issue Jul 24, 2017 · 27 comments
Labels
0. type: bug This is a bug 2. status: duplicate Another similar issue already exists

Comments

@azrdev
Copy link

azrdev commented Jul 24, 2017

My TP-Link TL-WR842N v2 with firmware from darmstadt.freifunk.net frequently reboots, usually it doesn't get more than 1 hour of uptime. Nothing useful on dmesg logs (except maybe lots of daemon.notice netifd: client (1352): cat: write error: Broken pipe), but I got a serial log, to be found at https://git.darmstadt.ccc.de/snippets/9

@mweinelt
Copy link
Contributor

The node (https://meshviewer.darmstadt.freifunk.net/#/en/map/10feed08eda6) is running with the latest changes from the master branch.

582d096

@Sunz3r
Copy link
Contributor

Sunz3r commented Jul 25, 2017

I found several nodes with high CPU-Load (SYS-Load > 95%) if mesh-on-LAN is active. The nextnode-Page wont load and sometimes the node crashs.
After disable mesh-Interface (like "ifconfig eth0 down") the problem disappears.

There is no ugly Flag in "batctl tg"

Gluon-Version: gluon-v2017.1.1+

@azrdev
Copy link
Author

azrdev commented Jul 25, 2017

@Sunz3r does this always occur when mesh on LAN is active, or only when there are also connections on the LAN interfaces (cable plugged in and/or other batman nodes to communicate with)?

@T-X
Copy link
Contributor

T-X commented Aug 1, 2017

@azrdev: I see a process called "autoupdater" and "10stop-network" in the provided log. So seems that it crashed while trying to update?

Can you maybe reliably reproduce the crash when running /usr/sbin/autoupdater manually?

@T-X
Copy link
Contributor

T-X commented Aug 1, 2017

Also, the 842nd v1/v2 seems to be one of those devices with 8MB of flash, but still only 32MB of RAM. Which could explain why this type of device is having issues while trying to update first. Compared to a 841nd, for instance, which has 32MB of RAM too, but only needs to store a 4MB image when updating.

@T-X
Copy link
Contributor

T-X commented Aug 1, 2017

@Sunz3r: Seems like a different issue. Maybe create a new ticket in the issue tracker here on Github?

@azrdev
Copy link
Author

azrdev commented Aug 1, 2017

Can you maybe reliably reproduce the crash when running /usr/sbin/autoupdater manually?

@T-X might be. If so, how would that help us / what should I provide?

@T-X
Copy link
Contributor

T-X commented Aug 1, 2017

@azrdev: One first, interesting thing to find out would be whether the crash happens during or after downloading the image. Can you add some "print/write" statements writing to /dev/kmesg in /usr/sbin/autoupdater to output some debug messages, so we know better at what time of the updating process things get the Out-of-Memory?

If it were possible for you to reproduce the issue reliably then I think it might make sense to add some patches to increase the verbosity of the Out-of-Memory trace, too. For instance more detailed information regarding what is using how much memory not just in userspace but also in kernel space would be very interesting. Not sure, maybe it'd be possible to compile an ar71xx image with CONFIG_KERNEL_SLABINFO=y and dump /proc/slabinfo from within the OOM panic handler, too.

@T-X
Copy link
Contributor

T-X commented Aug 1, 2017

PS: @azrdev or if you can reliably trigger it by executing /usr/sbin/autoupdater from the login shell via the serial then you might not need to write to /dev/kmesg. Then it should be sufficient to write to stdout or stderr. You could sprinkle some lines like this in /usr/sbin/autoupdater then:

io.stderr:write('We are here - line XXX')

@azrdev
Copy link
Author

azrdev commented Aug 21, 2017

@T-X first results:

Without uplink and private wifi disabled (wireless.wan_radio0.disabled='1') it doesn't crash. I enabled private wifi again, and uptime still goes up. Have to stick in a cable into the WAN port again, so the setup is the same before moving to the debugging location. It just ran autoupdater once with OOM, once without, so in the current state we don't get useful results

@azrdev
Copy link
Author

azrdev commented Sep 9, 2017

seems like I can (currently) reproduce a crash while receiving the last ~third of the firmware image, i.e. in wget

@rotanid
Copy link
Member

rotanid commented Sep 9, 2017

would be interesting to see how another device with the same specs (e.g. WR841N) performs in the exact same situation (same spot, same configuration).
if the issue is the same, #753 would be the correct issue.

@T-X
Copy link
Contributor

T-X commented Sep 10, 2017 via email

@azrdev
Copy link
Author

azrdev commented Sep 10, 2017

@rotanid I'll test that as suggested.

azrdev, this reproduceable crash, is it with the private wifi enabled or disabled now? And this node itself has a fastd VPN uplink via its WAN port?

@T-X private wifi enabled and fastd vpn uplink at wan port in use, too

@rotanid
Copy link
Member

rotanid commented Oct 13, 2017

@azrdev what about your check, may i close this issue in favor of #753 ?

@Adorfer
Copy link
Contributor

Adorfer commented Oct 13, 2017

For me this issue more specific about a certain router model.
(and not about all 32MB-RAM units in a certain domain.)

Potentially i would see similar issuo on TL-WA901V5 wen running on V4 image...
(this is not good, but since it looks similar, i assume some kind "wrong" in the target.)

@rotanid
Copy link
Member

rotanid commented Oct 13, 2017

thanks for not reading this ticket before leaving a comment.
we already agreed that he should test the issue with an WR841 in the same location and the same configuration - if there is no issue then, you might be right.
please don't simply assume this without a thorough test.

@Adorfer
Copy link
Contributor

Adorfer commented Oct 13, 2017

thanks for not reading this ticket before leaving a comment.

i guess i can remember it from reading the last n times. i guess this was not meant as an ad hominem.

my point was to disagree that's something like #753, just by the fact that your previous question was not marked as "needs answer" and not beeing answered.

in this case it's either
a) broken HW (individual defect)
b) broken target/profile
c) specific issue on network spot.
d) overall issue in the l2 domain.

your suggestion was to close this and to move in d).
And this assumption i can not follow, since i see -from what i see discussed from above- not an overall instabilty on the network (for certain types of hw class)

replacing the unit with an 841 would help probably the same way as a drop in replacement with an identical 842v2.
connecting a serial console might help as well, since OOMs are in many cases not transmited via network (ssh lograd -f, syslog...), since the network stack dies at this very moment.

@rotanid
Copy link
Member

rotanid commented Oct 13, 2017

your suggestion was to close this and to move in d).

simply wrong!
my suggestion was to check if another device in the same spot/config has the same problems.
if yes, it's d)
if not, it's not d)

@azrdev
Copy link
Author

azrdev commented Oct 13, 2017

sorry for delaying this, I'll do the test with the 841

@Adorfer
Copy link
Contributor

Adorfer commented Oct 13, 2017

@rotanid "@azrdev what about your check, may i close this issue in favor of #753 ?"
reads for me "either you perform the suggested check or we will assume this issue to be a totally different one.

But off course this might be a susccessful strategy to reduce number of issues in case there is no feedback for individual ones which sounded "different" when they were opened.

Anyhow, depending on the outcome here, i would consider to open a similar request for a 901v5 (frequent OOM reboots like https://paste.debian.net/989352/ , where a 841v11 in the same spot performs without problems. But since 1) i do not have a second 901v5 to test for individual HW defect, nor a 901v4 to see if it's an issue with the profile, nor is this build a LEDE, but CC: I can not open a topic. i just like to hint, that there might be similar situations on other routers too 'profile specific').
This is not an attempt to hijack the issue or to derail it, just a not, that am rather curious about the outcome of this one.
grafik

@azrdev
Copy link
Author

azrdev commented Nov 11, 2017

So, I had these running now for a month, logging uptime and load (manually, since our dashboard went down).

both nodes were in the same location as previously, and both had fastd vpn uplink via ethernet (wan port).
The 842 had private wifi disabled, still seems to have crashed occacionally, as the graph shows.
The 841 had private wifi enabled, and shows some suspiciously low uptime, too: seldomly above 24h, and a lot of reboots in a row.

I did not capture serial logs this time, but IMHO the data suggests that the 841 also frequently hangs with private wifi enabled

841:

2017-10-15 caek uptime 2

842

842nd

@Adorfer
Copy link
Contributor

Adorfer commented Nov 11, 2017

So what's your conclusion? To change it from
"oom on 842" to
"oom on 842 with private wifi off and 841 with private wifi on"?

(Sorry, this is not a serious suggestion, but what i may catch from your reply: "happens with 841 on same spot as well" correct?)

@azrdev
Copy link
Author

azrdev commented Nov 12, 2017

happens with 841 on same spot as well

yes, though I'm not sure if it's in the autoupdater (as with the 842) because I didn't capture a serial log

@rotanid
Copy link
Member

rotanid commented Nov 12, 2017

so if this isn't related to a specific device, it might as well be the same as #753 and/or #1243 - right?

@azrdev
Copy link
Author

azrdev commented Nov 14, 2017

@rotanid probably, yes. but iirc the 842 was already instable (long) before 2017.1.x so it would be #753 not #1243

@rotanid
Copy link
Member

rotanid commented Nov 14, 2017

@azrdev ok, let's continue this discussion over there then.

@rotanid rotanid closed this as completed Nov 14, 2017
@rotanid rotanid added the 0. type: bug This is a bug label Nov 14, 2017
@rotanid rotanid added the 2. status: duplicate Another similar issue already exists label Nov 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. type: bug This is a bug 2. status: duplicate Another similar issue already exists
Projects
None yet
Development

No branches or pull requests

6 participants