Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new patcher memory hooks #1495

Merged
merged 8 commits into from
Apr 14, 2016
Merged

Add new patcher memory hooks #1495

merged 8 commits into from
Apr 14, 2016

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Mar 25, 2016

This PR will add support for binary patching on ppc, x86, x86_64, and ia64. The work is not yet complete.

@hjelmn hjelmn added this to the v2.0.0 milestone Mar 25, 2016
@lanl-ompi
Copy link
Contributor

Test FAILed.

@hjelmn
Copy link
Member Author

hjelmn commented Mar 25, 2016

Looks like some work will be needed to handle the disable dlopen case. Just need to ensure -ldl is still there.

@hjelmn
Copy link
Member Author

hjelmn commented Mar 25, 2016

@gpaulsen Does Mark have a github account?

@lanl-ompi
Copy link
Contributor

Test FAILed.

1 similar comment
@lanl-ompi
Copy link
Contributor

Test FAILed.

@jsquyres
Copy link
Member

On Mar 25, 2016, at 4:21 PM, rhc54 [email protected] wrote:

Looks like some work will be needed to handle the disable dlopen case. Just need to ensure -ldl is still there.

It would really be good if we didn't have to do that (always ensure that -ldl is there).

You can't build statically if you have to -ldl.

@hjelmn
Copy link
Member Author

hjelmn commented Mar 25, 2016

Is there an alternate way to look up a symbol without using dlsym?

@jsquyres
Copy link
Member

Not that I'm aware of.

Put differently: is it ok to say that static builds can't use leave_pinned?

@hjelmn
Copy link
Member Author

hjelmn commented Mar 25, 2016

Not sure. lanl has always used static builds to limit the overhead on the filesystem.

@hjelmn
Copy link
Member Author

hjelmn commented Mar 25, 2016

Maybe static components should be different than disable dlopen?

@hjelmn
Copy link
Member Author

hjelmn commented Mar 25, 2016

How about we add a configure option like --enable-mca-static-components? Just changes all the build modes to static.... If there isn't already an option like that that isn't --disable-dlopen.

@jsquyres
Copy link
Member

Do you static builds, or disable-dlopen builds? Remember:

  • --disable-dlopen disables the dlopen() code (as the name implies) and slurps all the components into their project libraries
  • --enable-static --disable-shared does the same thing as --disable-dlopen and it builds .a files instead of .so files

If you're building .a files, the run-time linker can't share read-only copies of the libraries between processes (i.e., you eat up more RAM when you run num_cores MPI processes per server because you consume num_cores * (sizeof(libopen-pal) + sizeof(libopen-rte) + sizeof(libmpi)) instead of just (sizeof(libopen-pal) + sizeof(libopen-rte) + sizeof(libmpi))).

@hjelmn
Copy link
Member Author

hjelmn commented Mar 25, 2016

Ideally we want .so's for the projects but .a's for the components. The only way to get that now is --disable-dlopen.

@hjelmn
Copy link
Member Author

hjelmn commented Mar 25, 2016

Well. I could make a list of all static components but that is a pain.

@jsquyres
Copy link
Member

I'm confused -- how do you get .a's for the components?

  • --disable-dlopen doesn't install the individual components, because they're all slurped into the project libraries
  • --enable-static --disable-shared similarly doesn't install individual components

We only install .so versions of the components when we build OMPI with dlopen() support. We never install .a versions of the components.

@hjelmn
Copy link
Member Author

hjelmn commented Mar 26, 2016

No, I don't want .a's installed. I want all components to be part of their respective .so. That is not possible unless --disable-dlopen is used.

@jsquyres
Copy link
Member

Sounds like --enable-static does most of what you want:

  • slurps plugins into project libraries
  • builds .a and .so libraries
  • leaves dlopen() enabled

My whole point here is: we really, really shouldn't require -ldl for all cases. It will mess with the static build cases (i.e., building project .a libraries).

@markalle
Copy link
Contributor

I ran an experiment using &munmap instead of dlsym(RTLD_NEXT, "munmap") and on -static binaries it worked. The opal_patch_symbol() function could take a function pointer instead of a string as input and the caller could decide which to pass in.

Is that something that would fit with the Open MPI build-time options? I think a -static build is the only situation where using &munmap instead of dlsym() would work though.

@hjelmn hjelmn force-pushed the new_hooks branch 2 times, most recently from ace7d7a to 5000b45 Compare March 28, 2016 18:12
@jsquyres
Copy link
Member

@hjelmn Looks like there's a legit segv in the Mellanox jenkins. Is this due to some kind of interaction with UCX?

@hjelmn
Copy link
Member Author

hjelmn commented Mar 29, 2016

@markalle I refactored the code a little but I the linux patcher component is not working on Power8 with RHELL 6.7. I am working on the fix now.

@hjelmn
Copy link
Member Author

hjelmn commented Mar 30, 2016

:bot:retest:

@jsquyres
Copy link
Member

We had a call today with me, @hjelmn @gpaulsen @markalle to discuss.

tl;dr

Everything looks good so far. We need a few more things that @hjelmn thinks he can finish tomorrow.

More detail

Here's the items left:

  1. We need delayed patcher setup / during first rcache usage. This will allow OMPI to effectively guarantee that it is setup as the last memory intercepter (e.g., since UCX's current memory interception routine may not play nice if we're first... we're not 100% sure that it doesn't play nice, though -- it may be fine for OMPI to be first, but let's play safe and be last).
  2. We will need to delete memory/linux (i.e., the ptmalloc component) because it can't handle delayed setup.
  3. Need to look at current (old) UCX to confirm that OMPI registering 2nd is always going to be ok.
    • NOTE: if orte starts using BTLs, we might not be able to force 2nd registration
  4. Add a patcher base downcall to the UCX PML that basically indicates "I'm using UCX-style patching", which can therefore intercept the selection logic in patcher/linux and/or patcher/overwrite)
  5. Similarly, the upcoming PAMI PML can do the same thing.
    • @markalle will check with the PAMI people about all of this.
  6. @hjelmn is going to open a PR against UCX with the save/restore functionality that it is currently missing (but is included in this PR).
  7. LANL and IBM should run tests to make sure layering runs as expected.

@markalle
Copy link
Contributor

markalle commented Apr 1, 2016

I downloaded and built at 36abc43. I don't have a UCX setup readily available but I ran openib. For me the 'overwrite' patcher worked, but 'linux' failed.

I ran one rank on each of two hosts, with
--mca pml ob1 --mca btl openib,self --mca patcher linux

When I run a test that involves a lot of malloc/free of the send/recv buffers the 'overwrite' patcher gave data corruption.

I haven't thoroughly reviewed the code updates on 'overwrite' for putting the bits in a struct and allowing save/restore, but it seems right.

@hjelmn
Copy link
Member Author

hjelmn commented Apr 5, 2016

Yeah, one-sided is only through BTLs at the moment. Leave pinned it used for large user buffers in that path or large buffers in the ob1 send path.

@hjelmn
Copy link
Member Author

hjelmn commented Apr 8, 2016

Ok, got this working with ucx and the overwrite component. That will be sufficient for working with UCX. I will delete the linux component and re-add it once I figure out why it isn't quite working.

@hjelmn
Copy link
Member Author

hjelmn commented Apr 9, 2016

Ok, that Mellanox failure is interesting. I can reproduce on my systems. Using gdb I was able to determine what is happening. During pthread_create the heap is extended and as part of the process mmap is called on the heap with PROT_NONE (which is triggering an invalidate). The protection is removed by a called to mprotect. I think the right answer is not to hook mmap since cache entries within the heap are still valid even if they are protected PROT_NONE they just can't be used for communication. @markalle Thoughts?

@hjelmn
Copy link
Member Author

hjelmn commented Apr 9, 2016

@hppritcha I think this is converging on a working solution. I hope to move this into master on Sunday so it starts going through MTT in preparation of the PR to v2.x.

@lanl-ompi
Copy link
Contributor

Test FAILed.

@markalle
Copy link
Contributor

Between mmap(,,PROT_NONE,,,) and madvise(,,MADV_DONTNEED) I guess I had a lot of unnecessary interception. I can't get any testcase to prove it's needed. Both were added out of excessive caution of the unknown rather than a known issue. And unnecessary clearing of cached pins, especially in cases that don't happen much, doesn't really hurt anything generally.

I didn't follow the particulars of why the mmap(,,PROT_NONE,,,) interception failed, and I'm a little concerned since in general we should expect the interception to happen from "practically anywhere". But having said that I don't have any objection to removing it. I think it was indeed just another unnecessary interception.

@hjelmn
Copy link
Member Author

hjelmn commented Apr 10, 2016

The intercept failed in a threaded test. One thread was communicating out of a heap pointer during a call to pthread_create. pthread_create calls mmap (PROT_NONE) on the head the calls mprotect(). Not sure why it does that but it caused an abort because of the ongoing communication.

@hjelmn
Copy link
Member Author

hjelmn commented Apr 10, 2016

BTW, the mremap fix seems to have fixed the linux patcher. I am leaving the overwrite patcher as the default because it stacks with UCX.... Or not. Was running with UCX. Weird that it works when ucx is in use.

@hjelmn
Copy link
Member Author

hjelmn commented Apr 11, 2016

BTW, the way I get the linux patcher component to work with UCX is to patch dlsym to look for UCX's attempt to get the original symbol. Requires that our hooks are first but it appears to work every time.

@markalle
Copy link
Contributor

But is there any reason the thread couldn't call munmap() on some other piece of memory at the same location where you saw the mmap(,,PROT_NONE,,,)? It sounds like you're saying it's a bad spot to handle a memory-event from, and happily the PROT_NONE was an event we can safely ignore, but in general if legitimate memory events could happen at this location and our processing can break because it's too early in thread creation that's still a concern.

I don't follow what happens very far down in opal_mem_hooks_release_hook(), but if there's much at all going on there I'd be tempted to whittle it into an atomic FIFO that gets processed later. I did a little experimenting with the opal struct for that with multiple threads and was pretty impressed. Eg I used

typedef struct item_s {
    opal_list_item_t super;
    void *addr;
    size_t len;
} item_t;
opal_lifo_t free_list;
opal_fifo_t memevent_list;

several threads generate fake memory events and adding them to the fifo:

        item = (item_t*) opal_lifo_pop_atomic(&free_list);
        item->addr  and  item->len  =  fake memory event;
        opal_fifo_push_atomic(&memevent_list, &item->super);

while another thread reads off the events, eg

        item = (item_t*) opal_fifo_pop_atomic(&memevent_list);
        opal_lifo_push_atomic(&free_list, &item->super);

I guess I shouldn't be surprised the calls did what they claim. But I always feel like I need to test things like that.

The data structure did well, and I'm hoping ultimately there's not much more than cmpxchg and maybe short sections of lock-holding going on in their implementation. In general I'd want as little as possible going on upon receiving a memory event.

This commit adds support for runtime binary patching. The support is
broken down into two parts: util/opal_patcher.[ch] which contains the
functionality for runtime patching of symbols, and mca/memory/patcher
which patches the various symbols needed to provide support for memory
hooks. This work is preliminary and is based off work donated by IBM.

The patcher code is disabled if dlopen is disabled.

Signed-off-by: Nathan Hjelm <[email protected]>
This commit makes it possible to set relative priorities for
components. Before the addition of the patched component there was
only one component that would run on any system but that is no longer
the case. When determining which component to open each component's
query function is called and the one that returns the highest priority
is opened. The default priority of the patcher component is set
slightly higher than the old ptmalloc2/ummunotify component.

This commit fixes a long-standing break in the abstration of the
memory components. ompi_mpi_init.c was referencing the linux malloc
hook initilize function to ensure the hooks are initialized for
libmpi.so. The abstraction break has been fixed by adding a memory
base function that calls the open memory component's malloc hook init
function if it has one. The code is not yet complete but is intended
to support ptmalloc in 2.0.0. In that case the base function will
always call the ptmalloc hook init if exists.

Signed-off-by: Nathan Hjelm <[email protected]>
The --enable-static gives us what we want: statically linked components.

Signed-off-by: Nathan Hjelm <[email protected]>
This commit adds a framework to abstract runtime code patching.
Components in the new framework can provide functions for either
patching a named function or a function pointer. The later
functionality is not being used but may provide a way to allow memory
hooks when dlopen functionality is disabled.

This commit adds two different flavors of code patching. The first is
provided by the overwrite component. This component overwrites the
first several instructions of the target function with code to jump to
the provided hook function. The hook is expected to provide the full
functionality of the hooked function.

The linux patcher component is based on the memory hooks in ucx. It
only works on linux and operates by overwriting function pointers in
the symbol table. In this case the hook is free to call the original
function using the function pointer returned by dlsym.

Both components restore the original functions when the patcher
framework closes.

Changes had to be made to support Power/PowerPC with the Linux
dynamic loader patcher. Some of the changes:

 - Move code necessary for powerpc/power support to the patcher
   base. The code is needed by both the overwrite and linux
   components.

 - Move patch structure down to base and move the patch list to
   mca_patcher_base_module_t. The structure has been modified to
   include a function pointer to the function that will unapply the
   patch. This allows the mixing of multiple different types of
   patches in the patch_list.

 - Update linux patching code to keep track of the matching between
   got entry and original (unpatched) address. This allows us to
   completely clean up the patch on finalize.

All patchers keep track of the changes they made so that they can be
reversed when the patcher framework is closed.

At this time there are bugs in the Linux dynamic loader patcher so
its priority is lower than the overwrite patcher.

Signed-off-by: Nathan Hjelm <[email protected]>
This commit removes the ptmalloc2 memory hooks. This is necessary in
order to support lazy registration of memory hooks. A feature that is
not supported by the ptmalloc hooks but is supported by the new
patcher hooks.

Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes bugs that can cause crashes and memory corruption
when the mremap hook is called. The problem occurs because of the
ellipses (...) in the mremap intercept function. The ellipses cover
the optional new_addr argument on Linux. This commit removes the
ellipses and adds an explicit 5th argument.

This commit also adds a hook for shmdt. The code only works on Linux
at the moment as it needs to read /proc/self/maps to determine the
size of the shared memory segment.

Additionally, this commit removes the mmap hook. There is no
apparent benefit for detecting mmap(..., PROT_NONE, ...) and it
seems to cause problems when threads are in use.

Signed-off-by: Nathan Hjelm <[email protected]>
Because of the removal of the linux memory component it is no longer
necessary to initialize the memory component in opal_init(). This
commit moves the initialization to the creation of the first rcache
component.

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn
Copy link
Member Author

hjelmn commented Apr 13, 2016

Ok, squished this down quite a bit. Will merge after Jenkins.

@hjelmn hjelmn merged commit 1e6b4f2 into open-mpi:master Apr 14, 2016
@jsquyres
Copy link
Member

@markalle Did @gpaulsen answer your questions IRL? We talked about the status of these hooks in detail on the Tuesday call.

@markalle
Copy link
Contributor

Mmm, kind of. I'm satisfied that being in a thread whose heap is PROT_NONE is a special state from which we can't function normally. But I'm not positive that by not intercepting mmap(PROT_NONE) we guarantee we'll never intercept a memory event from a time when a thread is in this extra special bad state.

If I thought the state was detectable and the interception routines could do something other than segv, I'd still push for that since I like to be extra robust in case unexpected things happen. But I do tend to agree that I don't actually expect an munmap() to happen while the heap is PROT_NONE.

@markalle
Copy link
Contributor

I was looking at the "from_alloc" field, and asked Yossi/etc what MXM's requirements were. They indicated false positives would be okay but not false negatives. So we need to set it any time it's possible we're inside malloc/free/etc.

So I think intercept_munmap() needs to use true.

@markalle
Copy link
Contributor

@hjelmn Thoughts on opal_mem_hooks_release_hook (,,true) in intercept_munmap()? I think "true" is required since it could be in malloc/free/etc at the time. And generally I'd expect most munmap()s happen from inside a free().

@hjelmn
Copy link
Member Author

hjelmn commented Apr 18, 2016

@markalle Yeah, a malloc implementation could munmap a pointer. Will update.

@jsquyres
Copy link
Member

Followup: @hjelmn filed #1557 to fix this issue.

@jsquyres
Copy link
Member

Also -- just to link everything together: here's the corresponding 2.x PR: open-mpi/ompi-release#1079

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants