Test4_4 seems to be in deadlock on amd64_ubu14 #274

mxz297 · 2016-11-23T18:30:22Z

@cuviper We observed that current master branch will lead to a deadlock when running test4_4 on Ubuntu 14. Preliminary debugging suggests that the mutator is forever looping for checking whether two threads have finished. Merging your two pull requests about proccontrol does not seem solve this issue.

I attached the debugging log by enabling DYNINST_DEBUG_PROCCONTROL.

Do you have any insights on this issue?

https://gist.github.com/mxz297/23be1aa22fdecee056aa9fbfbe41464a

cuviper · 2016-11-23T19:10:21Z

Is the failure consistent? It seems to be fine on Fedora 25.

A lot of suspicious lines in that log, like these stand out to me:

[process.C:1512-U] - Error: plat_readMem failed!

[int_thread_db.C:1715-U] - Error: Attempt to read user thread info of 19593/19593 before user thread create

[process.C:1076-U] - No threads are running to produce events

wrwilliams · 2016-11-23T21:22:31Z

It's quite consistent (deterministic, even) and only showing up on Ubuntu 14 so far as we've been able to tell. We do seem to have something go sideways in parsing such that it takes quite a long time to make it through libc, but that also seems to be triggering some bad proccontrol-level behavior.

cuviper · 2016-11-23T23:06:50Z

FWIW, the infinite loop is not the new one I added, as that would say "Waiting for neonatal threads" in the log. It's the testsuite's contAndWaitForAllProcs that's stuck. (Maybe you already knew that.)

Do you have an older commit that works on the same machine? If so, can you git-bisect it?

mplegendre · 2016-11-28T18:44:48Z

The lines about failing to read user thread info and getTIDs are red herrings. Dyninst isn't careful about not reading user-thread info before that info is initialized, but it's not harmful.

I think the problem starts on line 5436 of the DYNINST_DEBUG_PROCCONTROL trace. After an exec, ProcControlAPI goes to re-create the breakpoint that monitors library loading, but fails to insert it. In the new process that's inserted into 7faba38e3940. In the previous process that was 7f45619a7940. Since only the upper bits changed, I'm guessing we have the symbol offset correct and the ld.so library's load address incorrect.

The ld.so load address should have come from the interpreter base read via /proc/PID/auxv in AuxvParser::readAuxvInfo in common/src/linuxKludges.C. Unfortunately, there's no good debug printing there, so you may have to add some to see if something's going wrong.

cuviper · 2016-11-28T23:47:12Z

I can reproduce this running as a normal user on Ubuntu 14.04, but as root it's fine. Root indeed gets the same "Error: Attempt to read user thread info...", but does not get the "Error: plat_readMem failed!" I suspected Ubuntu's "yama" security model, and sysctl kernel.yama.ptrace_scope=0 does in fact make it work again.

I'm surprised that ptrace_scope=1 would cause a failure here, because I wouldn't think an exec should affect the parent-child connection. But Fedora kernels now have yama too, disabled by default, and here it also fails plat_readMem when enabled.

cuviper · 2016-11-28T23:49:11Z

Maybe plat_readMem is also a red herring -- if the libraries were not detected properly in the new process, then that could simply be a bad address. But it's clear to me that yama is the gating factor.

mplegendre · 2016-11-29T22:23:14Z

I'd normally expect that if it were ptrace_scope issues then we wouldn't be able to operate on child process 19593 at all, but we handled its fork just fine. It's after the exec that we lose access to the process.

In the above log, the parent process 19591 exits soon before we start having problems with 19593. Could 19591's exit cause a re-parenting of 19593, and trigger the yama/ptrace_scope restrictions that prevent you from debugging processes you're not the parent of? We might be able to test that by throwing a sleep before the exit of the test4_4 parent.

I'm curious, does test pc_fork_exec also fail on these systems?

cuviper · 2016-11-30T02:11:11Z

pc_fork_exec is fine here.

OK, that's interesting about re-parenting though.

The actual call that's failing is process_vm_readv with EPERM. We have a fallback for bulk r/w to use normal ptrace calls after EFAULT, mainly because IIRC process_vm respects page protection and ptrace doesn't. If I extend that to fallback for EPERM too, then the test passes.

Now, ptrace only checks for security access when you're first attaching, whereas process_vm makes the check every time you call it. So yama_ptrace_access_check sees that the process is no longer a descendant of the tracer, since the re-parenting gave the orphan to init (pid 1), and denies process_vm. There's no leeway for the fact that we have an actual ptrace relationship already.

There is an escape hatch in ptracer_exception_found, but this is for relationships pre-arranged by prctl(PR_SET_PTRACER), which can set a specific parent or PR_SET_PTRACER_ANY. That might be an option just for the testsuite, but I don't see any evidence that this is inherited across forks, which makes it less useful.

I'll send a PR for the process_vm EPERM fallback, at least. This is easy and should be perfectly safe, because if we're already attached to a process, there's no other reason I can imagine for an EPERM besides yama obstruction. Note however that these access checks are also used on a lot on procfs files, so it's still possible we'll have other issues.

I'm also tempted to send a kernel patch for yama to allow active ptracers free reign. I'll think on it.

Having sysctl kernel.yama.ptrace_scope=1, one may only call ptrace attach on direct descendants. The same restriction is also checked for `process_vm_readv`/`writev` and certain procfs files. However, if an intermediate parent process already exited, we could end up with a grandchild that we're still ptracing but isn't our descendant, so we can't use the `process_vm` functions anymore -> `EPERM`. We already had a fallback here for `EFAULT`, to just use `ptrace` memory access, so use the same fallback after `EPERM` too. Fixes dyninst#274.

cuviper · 2016-12-01T02:30:32Z

I'm also tempted to send a kernel patch for yama to allow active ptracers free reign. I'll think on it.

https://lkml.org/lkml/2016/11/30/820

jdetter added the bug label Nov 29, 2016

cuviper mentioned this issue Nov 30, 2016

common: use ptrace if yama blocked process_vm_readv/writev #280

Merged

cuviper closed this as completed in #280 Nov 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test4_4 seems to be in deadlock on amd64_ubu14 #274

Test4_4 seems to be in deadlock on amd64_ubu14 #274

mxz297 commented Nov 23, 2016 •

edited

Loading

cuviper commented Nov 23, 2016

wrwilliams commented Nov 23, 2016

cuviper commented Nov 23, 2016

mplegendre commented Nov 28, 2016

cuviper commented Nov 28, 2016

cuviper commented Nov 28, 2016

mplegendre commented Nov 29, 2016

cuviper commented Nov 30, 2016

cuviper commented Dec 1, 2016

Test4_4 seems to be in deadlock on amd64_ubu14 #274

Test4_4 seems to be in deadlock on amd64_ubu14 #274

Comments

mxz297 commented Nov 23, 2016 • edited Loading

cuviper commented Nov 23, 2016

wrwilliams commented Nov 23, 2016

cuviper commented Nov 23, 2016

mplegendre commented Nov 28, 2016

cuviper commented Nov 28, 2016

cuviper commented Nov 28, 2016

mplegendre commented Nov 29, 2016

cuviper commented Nov 30, 2016

cuviper commented Dec 1, 2016

mxz297 commented Nov 23, 2016 •

edited

Loading