-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Null pointer dereference in kernel module #352
Comments
According to the vmlinux, it's crashing here. (One possible call stack is So apparently, some packet page is But what's more baffling is that you got two kernel crashes in so little time. The bug doesn't seem to have anything to do with joold, which means it's likely to not have been introduced in the #340 patches. But if it was still present in Jool 4.1.4, I'm surprised nobody seems to have run into it before. 4.1.4 is almost three months old. Would you be able to attempt to reproduce the bug with Jool 4.1.4, and without joold? |
Unfortunately, I don't have any reliable means to trigger this particular crash. Both of the times when it's crashed so far have been overnight when I've been asleep, and I've woken up in the morning to find the machine unresponsive. If I find that the crashes continue, I'll revert the #340 patches, reinstall, and see what happens without running |
Here's a likely better option: I just uploaded a patch to the issue352 branch. If my suspicions are correct, the patch should catch the bug, print a bunch of debugging information and leak the packet. (Instead of crashing the kernel.) It is not a long-term solution, but should give us valuable insight. Sample output:
You can get the output from |
Okay, I'll patch and rebuild the kernel module on the affected router, and see if that logs anything. |
Still nothing? |
Unfortunately I haven't seen any further cases of this error happening since I installed a version of Jool with the logging patch applied. |
Ok.
Therefore, I'm going to commit the temporary patch to master. A memory leak is a bug, but it's better than a kernel crash at least. I'm expecting to release 4.1.5 roughly at the end of the month. |
Okay, that works for me. |
Looking for #352 by code analysis, I was hampered by the cluttered Netlink API. Managing translator instances and Netlink responses separately lead to overcoding for no practical benefit, and managing error messages in yet another separate module was just confusing for no reason. (And dangerous, since it was storing the errors in a global variable. This didn't cause any errors that I'm aware of, though.) This centralizes all of that into a request handling state structure, jnl_state, which lives in the heap, and which mirrors the translation state structure. (xlation, which would benefit from a rename now.) This leads to a better substance-to-boilerplate ratio in the Netlink handlers. Unfortunately, this is just developer vanity. While I certainly prefer the new API, I didn't find #352 in the end, or any other errors for that matter.
4.1.5 released; removing from 4.1.5 milestone. Still no output in |
Still nothing, unfortunately. My test machines were moved to kernel 4.9 a few weeks after I opened this issue, in order to work around an unrelated IPv6 forwarding bug which was causing problems (I did briefly try 5.9 instead of moving backwards, but session state replication didn't immediately work and I didn't have time to work out why). I think this issue should probably be closed, as I haven't been able to reproduce this since my original report, and it can be reopened in the future if the bug reappears. |
You're free to close it if you want, but I'd personally rather keep it in the radar for at least a year. (It has the "Status: Stuck" tag, so it's easy to filter out of the TODO list.) |
Okay, I'll leave it open for now; if i have time to look at the 5.9 issues again, then I'll open a separate issue. |
I'm running Jool on a pair of Debian Buster VM's, which are acting as border routers for a test network I'm operating. Both of these machines are running Debian's 4.19.160 kernel, and Jool version 4.1.4 with the issue #340 patches cherry-picked from
master
(so NAT64 session replication works correctly).One of these two routers (which I suspect handles more traffic than the other) has spontaneously crashed twice within 24 hours, with the same kernel panic trace in both cases, which seems to indicate a null pointer dereference within the
jool_common
kernel module.Both machines are performing SIIT and NAT64 functions simultaneously, using
iptables
to inject traffic to the appropriate kernel module based on source and destination address:pool4
are sent tojool
.jool_siit
.pool6
prefix and whose source address is within a prefix configured in the SIIT EAMT are sent tojool_siit
.pool6
prefix are sent tojool
.iptables
on Debian Buster is a synonym foriptables-nft
, so the packet matching logic is actually attached to netfilter instead of the old kerneliptables
implementation.Panic trace:
The text was updated successfully, but these errors were encountered: