-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
riscv_fork.c: Fix vfork() for kernel mode + SMP #13633
Conversation
There was an error in the fork() routine when system calls are in use: the child context is saved on the child's user stack, which is incorrect, the context must be saved on the kernel stack instead. The result is a full system crash if (when) the child executes on a different CPU which does not have the same MMU mappings active.
@pussuw Thanks! I'll run the Stress Test now for knsh64. FYI I'm using this Stress Test Script, changing https://github.com/apache/nuttx/tree/master to https://github.com/tiiuae/nuttx/tree/riscv_fork_kernel_fix |
@pussuw Sorry knsh64 OSTest is failing intermittently at vfork(): https://gist.github.com/lupyuen/6a74c80dbe35496761c24606056ec2c2
|
@pussuw It's also failing at the Signals Test: https://gist.github.com/lupyuen/83c7e33837018a534eb5df09574efe99
Here's how I built NuttX for knsh64: https://gist.github.com/lupyuen/6721a8e9bd8dd2156fab62c9b765837f |
Yes this happens if you don't revert the 3 commits I mentioned. |
@pussuw Sounds like things are getting complicated :-) Wonder if should revert all these for now, and re-fix one by one?
We are also blocking #13579. Milk-V Duo S has been unstable for a week, I'm getting nervous 😬 |
@hujun260 already found the other issue with fork(). It is that parent->xcp.regs can change due to a context switch. He tried disabling interrupts in e4a0470#diff-247bf40a98451576669424ddccd9ded0ea35b233d2fc4903a625a201d347adc8R117-R120 but this is not enough. Interrupts are not the issue here, it is a potential context switch (which can happen in ISR of course, but more likely in some other synchronization point within the fork() call itself) which changes the value of parent->xcp.regs and then the integer register save area no longer points to where we want it to point, which is the saved user context when the user process first enters the kernel via ecall. |
Yes there are several overlapping issues now, that have emerged due to insufficient testing. The fork() issue is not related to these 3 though, the issue is entirely my doing, the initial fork() implementation for kernel mode is simply a buggy piece of trash. I'll fix it. |
PR #13585 is now overloaded with 2 things:
In addition it does these 2 things in 1 single patch. The PR must be split into 2 parts (2 separate PRs) and 2 patches as the fixes are not related to each other. |
Hi @hujun260 could you help please? I'm sorry it's blocking a couple of fixes. Thanks! |
We need to record the parent's integer register context upon exception entry to a separate non-volatile area. Why? Because xcp.regs can move due to a context switch within the fork() system call, be it either via interrupt or a synchronization point. Fix this by adding a "sregs" area where the saved user context is placed. The critical section within fork() is also unnecessary.
There is a fix for the race condition here as well. |
Looks like a nice and simple testbench. I'll shamelessly start using it locally myself :-P |
Thanks @pussuw: This version seems to be stuck consistently at the Signals Test, I'm not able to test vfork() though: https://gist.github.com/lupyuen/a73fe449b1a916207fcb90ab0a6ebc44 |
Yes the signal handler test is broken, I'm using this locally https://github.com/tiiuae/nuttx/tree/for_lups_testing |
At least for me this ran 200 iterations of vfork_test() no problem. |
Excellent @pussuw! I'm running 200 iterations of https://github.com/tiiuae/nuttx/tree/for_lups_testing and it seems OK for 30 iterations so far. How shall we proceed? |
This PR fixes the vfork issue so I think it can be safely merged without any other side effect. As per the signal test regression, #13585 fixes it, but the PR must be cleaned up first before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My Stress Test of rv-virt:knsh64 (reboot qemu + restart ostest) succeeded for all 200 iterations on https://github.com/tiiuae/nuttx/tree/for_lups_testing Thanks!
Summary
There was an error in the fork() routine when system calls are in use: the child context is saved on the child's user stack, which is incorrect, the context must be saved on the kernel stack instead.
The result is a full system crash if (when) the child executes on a different CPU which does not have the same MMU mappings active.
Impact
Only the RISC-V platform with CONFIG_LIB_SYSCALL=y is affected.
Testing
rv-virt:knsh64
rv-virt:ksmp64