frequent OOM on 842N v2 #1197

azrdev · 2017-07-24T17:49:01Z

My TP-Link TL-WR842N v2 with firmware from darmstadt.freifunk.net frequently reboots, usually it doesn't get more than 1 hour of uptime. Nothing useful on dmesg logs (except maybe lots of daemon.notice netifd: client (1352): cat: write error: Broken pipe), but I got a serial log, to be found at https://git.darmstadt.ccc.de/snippets/9

The text was updated successfully, but these errors were encountered:

mweinelt · 2017-07-24T17:53:24Z

The node (https://meshviewer.darmstadt.freifunk.net/#/en/map/10feed08eda6) is running with the latest changes from the master branch.

582d096

Sunz3r · 2017-07-25T09:16:20Z

I found several nodes with high CPU-Load (SYS-Load > 95%) if mesh-on-LAN is active. The nextnode-Page wont load and sometimes the node crashs.
After disable mesh-Interface (like "ifconfig eth0 down") the problem disappears.

There is no ugly Flag in "batctl tg"

Gluon-Version: gluon-v2017.1.1+

azrdev · 2017-07-25T11:54:14Z

@Sunz3r does this always occur when mesh on LAN is active, or only when there are also connections on the LAN interfaces (cable plugged in and/or other batman nodes to communicate with)?

T-X · 2017-08-01T15:56:34Z

@azrdev: I see a process called "autoupdater" and "10stop-network" in the provided log. So seems that it crashed while trying to update?

Can you maybe reliably reproduce the crash when running /usr/sbin/autoupdater manually?

T-X · 2017-08-01T16:04:39Z

Also, the 842nd v1/v2 seems to be one of those devices with 8MB of flash, but still only 32MB of RAM. Which could explain why this type of device is having issues while trying to update first. Compared to a 841nd, for instance, which has 32MB of RAM too, but only needs to store a 4MB image when updating.

T-X · 2017-08-01T16:07:40Z

@Sunz3r: Seems like a different issue. Maybe create a new ticket in the issue tracker here on Github?

azrdev · 2017-08-01T16:44:04Z

Can you maybe reliably reproduce the crash when running /usr/sbin/autoupdater manually?

@T-X might be. If so, how would that help us / what should I provide?

T-X · 2017-08-01T23:36:31Z

@azrdev: One first, interesting thing to find out would be whether the crash happens during or after downloading the image. Can you add some "print/write" statements writing to /dev/kmesg in /usr/sbin/autoupdater to output some debug messages, so we know better at what time of the updating process things get the Out-of-Memory?

If it were possible for you to reproduce the issue reliably then I think it might make sense to add some patches to increase the verbosity of the Out-of-Memory trace, too. For instance more detailed information regarding what is using how much memory not just in userspace but also in kernel space would be very interesting. Not sure, maybe it'd be possible to compile an ar71xx image with CONFIG_KERNEL_SLABINFO=y and dump /proc/slabinfo from within the OOM panic handler, too.

T-X · 2017-08-01T23:49:31Z

PS: @azrdev or if you can reliably trigger it by executing /usr/sbin/autoupdater from the login shell via the serial then you might not need to write to /dev/kmesg. Then it should be sufficient to write to stdout or stderr. You could sprinkle some lines like this in /usr/sbin/autoupdater then:

io.stderr:write('We are here - line XXX')

azrdev · 2017-08-21T14:32:21Z

@T-X first results:

Without uplink and private wifi disabled (wireless.wan_radio0.disabled='1') it doesn't crash. I enabled private wifi again, and uptime still goes up. Have to stick in a cable into the WAN port again, so the setup is the same before moving to the debugging location. It just ran autoupdater once with OOM, once without, so in the current state we don't get useful results

azrdev · 2017-09-09T21:07:26Z

seems like I can (currently) reproduce a crash while receiving the last ~third of the firmware image, i.e. in wget

rotanid · 2017-09-09T23:19:16Z

would be interesting to see how another device with the same specs (e.g. WR841N) performs in the exact same situation (same spot, same configuration).
if the issue is the same, #753 would be the correct issue.

T-X · 2017-09-10T00:55:03Z

azrdev, this reproduceable crash, is it with the private wifi enabled or disabled now? And this node itself has a fastd VPN uplink via its WAN port?

azrdev · 2017-09-10T09:30:49Z

@rotanid I'll test that as suggested.

azrdev, this reproduceable crash, is it with the private wifi enabled or disabled now? And this node itself has a fastd VPN uplink via its WAN port?

@T-X private wifi enabled and fastd vpn uplink at wan port in use, too

rotanid · 2017-10-13T00:40:04Z

@azrdev what about your check, may i close this issue in favor of #753 ?

Adorfer · 2017-10-13T02:08:57Z

For me this issue more specific about a certain router model.
(and not about all 32MB-RAM units in a certain domain.)

Potentially i would see similar issuo on TL-WA901V5 wen running on V4 image...
(this is not good, but since it looks similar, i assume some kind "wrong" in the target.)

rotanid · 2017-10-13T02:44:13Z

thanks for not reading this ticket before leaving a comment.
we already agreed that he should test the issue with an WR841 in the same location and the same configuration - if there is no issue then, you might be right.
please don't simply assume this without a thorough test.

Adorfer · 2017-10-13T03:06:49Z

thanks for not reading this ticket before leaving a comment.

i guess i can remember it from reading the last n times. i guess this was not meant as an ad hominem.

my point was to disagree that's something like #753, just by the fact that your previous question was not marked as "needs answer" and not beeing answered.

in this case it's either
a) broken HW (individual defect)
b) broken target/profile
c) specific issue on network spot.
d) overall issue in the l2 domain.

your suggestion was to close this and to move in d).
And this assumption i can not follow, since i see -from what i see discussed from above- not an overall instabilty on the network (for certain types of hw class)

replacing the unit with an 841 would help probably the same way as a drop in replacement with an identical 842v2.
connecting a serial console might help as well, since OOMs are in many cases not transmited via network (ssh lograd -f, syslog...), since the network stack dies at this very moment.

rotanid · 2017-10-13T10:47:56Z

your suggestion was to close this and to move in d).

simply wrong!
my suggestion was to check if another device in the same spot/config has the same problems.
if yes, it's d)
if not, it's not d)

azrdev · 2017-10-13T10:54:19Z

sorry for delaying this, I'll do the test with the 841

Adorfer · 2017-10-13T13:20:56Z

@rotanid "@azrdev what about your check, may i close this issue in favor of #753 ?"
reads for me "either you perform the suggested check or we will assume this issue to be a totally different one.

But off course this might be a susccessful strategy to reduce number of issues in case there is no feedback for individual ones which sounded "different" when they were opened.

Anyhow, depending on the outcome here, i would consider to open a similar request for a 901v5 (frequent OOM reboots like https://paste.debian.net/989352/ , where a 841v11 in the same spot performs without problems. But since 1) i do not have a second 901v5 to test for individual HW defect, nor a 901v4 to see if it's an issue with the profile, nor is this build a LEDE, but CC: I can not open a topic. i just like to hint, that there might be similar situations on other routers too 'profile specific').
This is not an attempt to hijack the issue or to derail it, just a not, that am rather curious about the outcome of this one.

azrdev · 2017-11-11T21:22:26Z

So, I had these running now for a month, logging uptime and load (manually, since our dashboard went down).

both nodes were in the same location as previously, and both had fastd vpn uplink via ethernet (wan port).
The 842 had private wifi disabled, still seems to have crashed occacionally, as the graph shows.
The 841 had private wifi enabled, and shows some suspiciously low uptime, too: seldomly above 24h, and a lot of reboots in a row.

I did not capture serial logs this time, but IMHO the data suggests that the 841 also frequently hangs with private wifi enabled

841:

842

Adorfer · 2017-11-11T21:26:40Z

So what's your conclusion? To change it from
"oom on 842" to
"oom on 842 with private wifi off and 841 with private wifi on"?

(Sorry, this is not a serious suggestion, but what i may catch from your reply: "happens with 841 on same spot as well" correct?)

azrdev · 2017-11-12T10:22:20Z

happens with 841 on same spot as well

yes, though I'm not sure if it's in the autoupdater (as with the 842) because I didn't capture a serial log

rotanid · 2017-11-12T17:47:01Z

so if this isn't related to a specific device, it might as well be the same as #753 and/or #1243 - right?

azrdev · 2017-11-14T11:41:31Z

@rotanid probably, yes. but iirc the 842 was already instable (long) before 2017.1.x so it would be #753 not #1243

rotanid · 2017-11-14T13:55:45Z

@azrdev ok, let's continue this discussion over there then.

T-X mentioned this issue Nov 13, 2017

High load on some devices after v2017.1.x update #1243

Closed

rotanid closed this as completed Nov 14, 2017

rotanid added the 0. type: bug This is a bug label Nov 14, 2017

rotanid added the 2. status: duplicate Another similar issue already exists label Nov 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frequent OOM on 842N v2 #1197

frequent OOM on 842N v2 #1197

azrdev commented Jul 24, 2017

mweinelt commented Jul 24, 2017

Sunz3r commented Jul 25, 2017

azrdev commented Jul 25, 2017

T-X commented Aug 1, 2017

T-X commented Aug 1, 2017

T-X commented Aug 1, 2017

azrdev commented Aug 1, 2017

T-X commented Aug 1, 2017 •

edited

Loading

T-X commented Aug 1, 2017

azrdev commented Aug 21, 2017

azrdev commented Sep 9, 2017

rotanid commented Sep 9, 2017

T-X commented Sep 10, 2017 via email

azrdev commented Sep 10, 2017

rotanid commented Oct 13, 2017

Adorfer commented Oct 13, 2017

rotanid commented Oct 13, 2017

Adorfer commented Oct 13, 2017

rotanid commented Oct 13, 2017

azrdev commented Oct 13, 2017

Adorfer commented Oct 13, 2017

azrdev commented Nov 11, 2017

Adorfer commented Nov 11, 2017

azrdev commented Nov 12, 2017

rotanid commented Nov 12, 2017

azrdev commented Nov 14, 2017

rotanid commented Nov 14, 2017

frequent OOM on 842N v2 #1197

frequent OOM on 842N v2 #1197

Comments

azrdev commented Jul 24, 2017

mweinelt commented Jul 24, 2017

Sunz3r commented Jul 25, 2017

azrdev commented Jul 25, 2017

T-X commented Aug 1, 2017

T-X commented Aug 1, 2017

T-X commented Aug 1, 2017

azrdev commented Aug 1, 2017

T-X commented Aug 1, 2017 • edited Loading

T-X commented Aug 1, 2017

azrdev commented Aug 21, 2017

azrdev commented Sep 9, 2017

rotanid commented Sep 9, 2017

T-X commented Sep 10, 2017 via email

azrdev commented Sep 10, 2017

rotanid commented Oct 13, 2017

Adorfer commented Oct 13, 2017

rotanid commented Oct 13, 2017

Adorfer commented Oct 13, 2017

rotanid commented Oct 13, 2017

azrdev commented Oct 13, 2017

Adorfer commented Oct 13, 2017

azrdev commented Nov 11, 2017

841:

842

Adorfer commented Nov 11, 2017

azrdev commented Nov 12, 2017

rotanid commented Nov 12, 2017

azrdev commented Nov 14, 2017

rotanid commented Nov 14, 2017

T-X commented Aug 1, 2017 •

edited

Loading