Add new patcher memory hooks #1495

hjelmn · 2016-03-25T20:10:05Z

This PR will add support for binary patching on ppc, x86, x86_64, and ia64. The work is not yet complete.

lanl-ompi · 2016-03-25T20:19:27Z

Test FAILed.

hjelmn · 2016-03-25T20:21:01Z

Looks like some work will be needed to handle the disable dlopen case. Just need to ensure -ldl is still there.

hjelmn · 2016-03-25T20:21:27Z

@gpaulsen Does Mark have a github account?

lanl-ompi · 2016-03-25T20:27:34Z

Test FAILed.

lanl-ompi · 2016-03-25T20:52:24Z

Test FAILed.

jsquyres · 2016-03-25T21:19:10Z

On Mar 25, 2016, at 4:21 PM, rhc54 [email protected] wrote:

Looks like some work will be needed to handle the disable dlopen case. Just need to ensure -ldl is still there.

It would really be good if we didn't have to do that (always ensure that -ldl is there).

You can't build statically if you have to -ldl.

hjelmn · 2016-03-25T22:03:27Z

Is there an alternate way to look up a symbol without using dlsym?

jsquyres · 2016-03-25T22:04:24Z

Not that I'm aware of.

Put differently: is it ok to say that static builds can't use leave_pinned?

hjelmn · 2016-03-25T22:05:21Z

Not sure. lanl has always used static builds to limit the overhead on the filesystem.

hjelmn · 2016-03-25T22:06:06Z

Maybe static components should be different than disable dlopen?

hjelmn · 2016-03-25T22:07:49Z

How about we add a configure option like --enable-mca-static-components? Just changes all the build modes to static.... If there isn't already an option like that that isn't --disable-dlopen.

jsquyres · 2016-03-25T22:36:05Z

Do you static builds, or disable-dlopen builds? Remember:

--disable-dlopen disables the dlopen() code (as the name implies) and slurps all the components into their project libraries
--enable-static --disable-shared does the same thing as --disable-dlopen and it builds .a files instead of .so files

If you're building .a files, the run-time linker can't share read-only copies of the libraries between processes (i.e., you eat up more RAM when you run num_cores MPI processes per server because you consume num_cores * (sizeof(libopen-pal) + sizeof(libopen-rte) + sizeof(libmpi)) instead of just (sizeof(libopen-pal) + sizeof(libopen-rte) + sizeof(libmpi))).

hjelmn · 2016-03-25T22:42:57Z

Ideally we want .so's for the projects but .a's for the components. The only way to get that now is --disable-dlopen.

hjelmn · 2016-03-25T22:43:50Z

Well. I could make a list of all static components but that is a pain.

jsquyres · 2016-03-25T23:26:40Z

I'm confused -- how do you get .a's for the components?

--disable-dlopen doesn't install the individual components, because they're all slurped into the project libraries
--enable-static --disable-shared similarly doesn't install individual components

We only install .so versions of the components when we build OMPI with dlopen() support. We never install .a versions of the components.

hjelmn · 2016-03-26T00:10:47Z

No, I don't want .a's installed. I want all components to be part of their respective .so. That is not possible unless --disable-dlopen is used.

jsquyres · 2016-03-26T00:22:35Z

Sounds like --enable-static does most of what you want:

slurps plugins into project libraries
builds .a and .so libraries
leaves dlopen() enabled

My whole point here is: we really, really shouldn't require -ldl for all cases. It will mess with the static build cases (i.e., building project .a libraries).

markalle · 2016-03-28T16:55:22Z

I ran an experiment using &munmap instead of dlsym(RTLD_NEXT, "munmap") and on -static binaries it worked. The opal_patch_symbol() function could take a function pointer instead of a string as input and the caller could decide which to pass in.

Is that something that would fit with the Open MPI build-time options? I think a -static build is the only situation where using &munmap instead of dlsym() would work though.

jsquyres · 2016-03-28T20:11:43Z

@hjelmn Looks like there's a legit segv in the Mellanox jenkins. Is this due to some kind of interaction with UCX?

hjelmn · 2016-03-29T22:40:25Z

@markalle I refactored the code a little but I the linux patcher component is not working on Power8 with RHELL 6.7. I am working on the fix now.

hjelmn · 2016-03-30T04:56:20Z

:bot:retest:

jsquyres · 2016-03-31T18:54:34Z

We had a call today with me, @hjelmn @gpaulsen @markalle to discuss.

tl;dr

Everything looks good so far. We need a few more things that @hjelmn thinks he can finish tomorrow.

More detail

Here's the items left:

We need delayed patcher setup / during first rcache usage. This will allow OMPI to effectively guarantee that it is setup as the last memory intercepter (e.g., since UCX's current memory interception routine may not play nice if we're first... we're not 100% sure that it doesn't play nice, though -- it may be fine for OMPI to be first, but let's play safe and be last).
We will need to delete memory/linux (i.e., the ptmalloc component) because it can't handle delayed setup.
Need to look at current (old) UCX to confirm that OMPI registering 2nd is always going to be ok.
- NOTE: if orte starts using BTLs, we might not be able to force 2nd registration
Add a patcher base downcall to the UCX PML that basically indicates "I'm using UCX-style patching", which can therefore intercept the selection logic in patcher/linux and/or patcher/overwrite)
Similarly, the upcoming PAMI PML can do the same thing.
- @markalle will check with the PAMI people about all of this.
@hjelmn is going to open a PR against UCX with the save/restore functionality that it is currently missing (but is included in this PR).
LANL and IBM should run tests to make sure layering runs as expected.

markalle · 2016-04-01T17:38:54Z

I downloaded and built at 36abc43. I don't have a UCX setup readily available but I ran openib. For me the 'overwrite' patcher worked, but 'linux' failed.

I ran one rank on each of two hosts, with
--mca pml ob1 --mca btl openib,self --mca patcher linux

When I run a test that involves a lot of malloc/free of the send/recv buffers the 'overwrite' patcher gave data corruption.

I haven't thoroughly reviewed the code updates on 'overwrite' for putting the bits in a struct and allowing save/restore, but it seems right.

hjelmn · 2016-04-05T01:12:39Z

Yeah, one-sided is only through BTLs at the moment. Leave pinned it used for large user buffers in that path or large buffers in the ob1 send path.

hjelmn · 2016-04-08T20:21:55Z

Ok, got this working with ucx and the overwrite component. That will be sufficient for working with UCX. I will delete the linux component and re-add it once I figure out why it isn't quite working.

hjelmn · 2016-04-09T17:57:48Z

Ok, that Mellanox failure is interesting. I can reproduce on my systems. Using gdb I was able to determine what is happening. During pthread_create the heap is extended and as part of the process mmap is called on the heap with PROT_NONE (which is triggering an invalidate). The protection is removed by a called to mprotect. I think the right answer is not to hook mmap since cache entries within the heap are still valid even if they are protected PROT_NONE they just can't be used for communication. @markalle Thoughts?

hjelmn · 2016-04-09T18:02:08Z

@hppritcha I think this is converging on a working solution. I hope to move this into master on Sunday so it starts going through MTT in preparation of the PR to v2.x.

lanl-ompi · 2016-04-09T18:09:54Z

Test FAILed.

markalle · 2016-04-10T01:06:51Z

Between mmap(,,PROT_NONE,,,) and madvise(,,MADV_DONTNEED) I guess I had a lot of unnecessary interception. I can't get any testcase to prove it's needed. Both were added out of excessive caution of the unknown rather than a known issue. And unnecessary clearing of cached pins, especially in cases that don't happen much, doesn't really hurt anything generally.

I didn't follow the particulars of why the mmap(,,PROT_NONE,,,) interception failed, and I'm a little concerned since in general we should expect the interception to happen from "practically anywhere". But having said that I don't have any objection to removing it. I think it was indeed just another unnecessary interception.

hjelmn · 2016-04-10T01:48:41Z

The intercept failed in a threaded test. One thread was communicating out of a heap pointer during a call to pthread_create. pthread_create calls mmap (PROT_NONE) on the head the calls mprotect(). Not sure why it does that but it caused an abort because of the ongoing communication.

hjelmn · 2016-04-10T01:56:06Z

BTW, the mremap fix seems to have fixed the linux patcher. I am leaving the overwrite patcher as the default because it stacks with UCX.... Or not. Was running with UCX. Weird that it works when ucx is in use.

hjelmn · 2016-04-11T13:46:24Z

BTW, the way I get the linux patcher component to work with UCX is to patch dlsym to look for UCX's attempt to get the original symbol. Requires that our hooks are first but it appears to work every time.

markalle · 2016-04-12T00:44:38Z

But is there any reason the thread couldn't call munmap() on some other piece of memory at the same location where you saw the mmap(,,PROT_NONE,,,)? It sounds like you're saying it's a bad spot to handle a memory-event from, and happily the PROT_NONE was an event we can safely ignore, but in general if legitimate memory events could happen at this location and our processing can break because it's too early in thread creation that's still a concern.

I don't follow what happens very far down in opal_mem_hooks_release_hook(), but if there's much at all going on there I'd be tempted to whittle it into an atomic FIFO that gets processed later. I did a little experimenting with the opal struct for that with multiple threads and was pretty impressed. Eg I used

typedef struct item_s {
    opal_list_item_t super;
    void *addr;
    size_t len;
} item_t;
opal_lifo_t free_list;
opal_fifo_t memevent_list;

several threads generate fake memory events and adding them to the fifo:

        item = (item_t*) opal_lifo_pop_atomic(&free_list);
        item->addr  and  item->len  =  fake memory event;
        opal_fifo_push_atomic(&memevent_list, &item->super);

while another thread reads off the events, eg

        item = (item_t*) opal_fifo_pop_atomic(&memevent_list);
        opal_lifo_push_atomic(&free_list, &item->super);

I guess I shouldn't be surprised the calls did what they claim. But I always feel like I need to test things like that.

The data structure did well, and I'm hoping ultimately there's not much more than cmpxchg and maybe short sections of lock-holding going on in their implementation. In general I'd want as little as possible going on upon receiving a memory event.

This commit adds support for runtime binary patching. The support is broken down into two parts: util/opal_patcher.[ch] which contains the functionality for runtime patching of symbols, and mca/memory/patcher which patches the various symbols needed to provide support for memory hooks. This work is preliminary and is based off work donated by IBM. The patcher code is disabled if dlopen is disabled. Signed-off-by: Nathan Hjelm <[email protected]>

This commit makes it possible to set relative priorities for components. Before the addition of the patched component there was only one component that would run on any system but that is no longer the case. When determining which component to open each component's query function is called and the one that returns the highest priority is opened. The default priority of the patcher component is set slightly higher than the old ptmalloc2/ummunotify component. This commit fixes a long-standing break in the abstration of the memory components. ompi_mpi_init.c was referencing the linux malloc hook initilize function to ensure the hooks are initialized for libmpi.so. The abstraction break has been fixed by adding a memory base function that calls the open memory component's malloc hook init function if it has one. The code is not yet complete but is intended to support ptmalloc in 2.0.0. In that case the base function will always call the ptmalloc hook init if exists. Signed-off-by: Nathan Hjelm <[email protected]>

Signed-off-by: Nathan Hjelm <[email protected]>

The --enable-static gives us what we want: statically linked components. Signed-off-by: Nathan Hjelm <[email protected]>

This commit adds a framework to abstract runtime code patching. Components in the new framework can provide functions for either patching a named function or a function pointer. The later functionality is not being used but may provide a way to allow memory hooks when dlopen functionality is disabled. This commit adds two different flavors of code patching. The first is provided by the overwrite component. This component overwrites the first several instructions of the target function with code to jump to the provided hook function. The hook is expected to provide the full functionality of the hooked function. The linux patcher component is based on the memory hooks in ucx. It only works on linux and operates by overwriting function pointers in the symbol table. In this case the hook is free to call the original function using the function pointer returned by dlsym. Both components restore the original functions when the patcher framework closes. Changes had to be made to support Power/PowerPC with the Linux dynamic loader patcher. Some of the changes: - Move code necessary for powerpc/power support to the patcher base. The code is needed by both the overwrite and linux components. - Move patch structure down to base and move the patch list to mca_patcher_base_module_t. The structure has been modified to include a function pointer to the function that will unapply the patch. This allows the mixing of multiple different types of patches in the patch_list. - Update linux patching code to keep track of the matching between got entry and original (unpatched) address. This allows us to completely clean up the patch on finalize. All patchers keep track of the changes they made so that they can be reversed when the patcher framework is closed. At this time there are bugs in the Linux dynamic loader patcher so its priority is lower than the overwrite patcher. Signed-off-by: Nathan Hjelm <[email protected]>

This commit removes the ptmalloc2 memory hooks. This is necessary in order to support lazy registration of memory hooks. A feature that is not supported by the ptmalloc hooks but is supported by the new patcher hooks. Signed-off-by: Nathan Hjelm <[email protected]>

This commit fixes bugs that can cause crashes and memory corruption when the mremap hook is called. The problem occurs because of the ellipses (...) in the mremap intercept function. The ellipses cover the optional new_addr argument on Linux. This commit removes the ellipses and adds an explicit 5th argument. This commit also adds a hook for shmdt. The code only works on Linux at the moment as it needs to read /proc/self/maps to determine the size of the shared memory segment. Additionally, this commit removes the mmap hook. There is no apparent benefit for detecting mmap(..., PROT_NONE, ...) and it seems to cause problems when threads are in use. Signed-off-by: Nathan Hjelm <[email protected]>

Because of the removal of the linux memory component it is no longer necessary to initialize the memory component in opal_init(). This commit moves the initialization to the creation of the first rcache component. Signed-off-by: Nathan Hjelm <[email protected]>

hjelmn · 2016-04-13T23:23:43Z

Ok, squished this down quite a bit. Will merge after Jenkins.

jsquyres · 2016-04-14T22:03:52Z

@markalle Did @gpaulsen answer your questions IRL? We talked about the status of these hooks in detail on the Tuesday call.

markalle · 2016-04-15T19:20:02Z

Mmm, kind of. I'm satisfied that being in a thread whose heap is PROT_NONE is a special state from which we can't function normally. But I'm not positive that by not intercepting mmap(PROT_NONE) we guarantee we'll never intercept a memory event from a time when a thread is in this extra special bad state.

If I thought the state was detectable and the interception routines could do something other than segv, I'd still push for that since I like to be extra robust in case unexpected things happen. But I do tend to agree that I don't actually expect an munmap() to happen while the heap is PROT_NONE.

markalle · 2016-04-15T20:21:43Z

I was looking at the "from_alloc" field, and asked Yossi/etc what MXM's requirements were. They indicated false positives would be okay but not false negatives. So we need to set it any time it's possible we're inside malloc/free/etc.

So I think intercept_munmap() needs to use true.

markalle · 2016-04-18T20:47:25Z

@hjelmn Thoughts on opal_mem_hooks_release_hook (,,true) in intercept_munmap()? I think "true" is required since it could be in malloc/free/etc at the time. And generally I'd expect most munmap()s happen from inside a free().

hjelmn · 2016-04-18T20:59:00Z

@markalle Yeah, a malloc implementation could munmap a pointer. Will update.

jsquyres · 2016-04-18T21:19:50Z

Followup: @hjelmn filed #1557 to fix this issue.

jsquyres · 2016-04-18T21:21:24Z

Also -- just to link everything together: here's the corresponding 2.x PR: open-mpi/ompi-release#1079

hjelmn added the Severity: critical label Mar 25, 2016

hjelmn added this to the v2.0.0 milestone Mar 25, 2016

jsquyres mentioned this pull request Mar 26, 2016

munmap is not being intercepted for cache refresh #299

Closed

hjelmn force-pushed the new_hooks branch 2 times, most recently from ace7d7a to 5000b45 Compare March 28, 2016 18:12

jsquyres added the Severity: blocker label Mar 28, 2016

hjelmn force-pushed the new_hooks branch from 1454ee1 to 36abc43 Compare March 30, 2016 20:24

hjelmn force-pushed the new_hooks branch from 36abc43 to 73bbaaa Compare April 9, 2016 03:13

hjelmn force-pushed the new_hooks branch from d0b4ed3 to 4b5ffab Compare April 9, 2016 18:11

hjelmn added 8 commits April 13, 2016 17:14

opal/patch: add call to check if binary patching is supported

4cac623

Signed-off-by: Nathan Hjelm <[email protected]>

contrib/platform: don't disable dlopen

b1670f8

The --enable-static gives us what we want: statically linked components. Signed-off-by: Nathan Hjelm <[email protected]>

hjelmn force-pushed the new_hooks branch from 4b5ffab to c2b6fbb Compare April 13, 2016 23:23

hjelmn merged commit 1e6b4f2 into open-mpi:master Apr 14, 2016

Add new patcher memory hooks #1495

Add new patcher memory hooks #1495

Conversation

hjelmn commented Mar 25, 2016

lanl-ompi commented Mar 25, 2016

hjelmn commented Mar 25, 2016

hjelmn commented Mar 25, 2016

lanl-ompi commented Mar 25, 2016

lanl-ompi commented Mar 25, 2016

jsquyres commented Mar 25, 2016

hjelmn commented Mar 25, 2016

jsquyres commented Mar 25, 2016

hjelmn commented Mar 25, 2016

hjelmn commented Mar 25, 2016

hjelmn commented Mar 25, 2016

jsquyres commented Mar 25, 2016

hjelmn commented Mar 25, 2016

hjelmn commented Mar 25, 2016

jsquyres commented Mar 25, 2016

hjelmn commented Mar 26, 2016

jsquyres commented Mar 26, 2016

markalle commented Mar 28, 2016

jsquyres commented Mar 28, 2016

hjelmn commented Mar 29, 2016

hjelmn commented Mar 30, 2016

jsquyres commented Mar 31, 2016

tl;dr

More detail

markalle commented Apr 1, 2016

hjelmn commented Apr 5, 2016

hjelmn commented Apr 8, 2016

hjelmn commented Apr 9, 2016

hjelmn commented Apr 9, 2016

lanl-ompi commented Apr 9, 2016

markalle commented Apr 10, 2016

hjelmn commented Apr 10, 2016

hjelmn commented Apr 10, 2016

hjelmn commented Apr 11, 2016

markalle commented Apr 12, 2016

hjelmn commented Apr 13, 2016

jsquyres commented Apr 14, 2016

markalle commented Apr 15, 2016

markalle commented Apr 15, 2016

markalle commented Apr 18, 2016

hjelmn commented Apr 18, 2016

jsquyres commented Apr 18, 2016

jsquyres commented Apr 18, 2016