Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osc/rdma: adjustment on btl selection logic #9696

Closed

Conversation

wzamazon
Copy link
Contributor

@wzamazon wzamazon commented Nov 23, 2021

This PR contain 3 commits:

first one update btl's active message so it always support remote completion

The other 2 update the btl selection logic in osc/rdma.

@wzamazon
Copy link
Contributor Author

CI will fail because this PR depends on #9694

@ibm-ompi
Copy link

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/ed6555c1b3042a027157d39ff4589bde

@ibm-ompi
Copy link

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/86e0709aa974c58a53976db5cf22c439

@ibm-ompi
Copy link

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/23fe703dd200c2de96ce74ec90ba8dbd

* use of registration-based RDMA (these BTLs will not be used) and will use
* any remaining BTL. By default the BTLs used will be tcp and sm but any single
* (or pair) of BTLs may be used.
* This function is used when there ompi_osc_rdm_query_btls() failed to find
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failed to find a single btl that can communicate with all peers and supports remote completion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! fixed in new revision.

}

++btls_found;
if (module) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know Nathan did this wrong, but we explicitly compare to NULL in Open MPI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! fixed in new revision.


OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_INFO, "found alternate btl %s", btls_to_use[i]);
OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_INFO, "disabing btl's native support of RDMA and ATOMIC");
item->btl_module->btl_flags &= ~(MCA_BTL_FLAGS_RDMA | MCA_BTL_FLAGS_GET | MCA_BTL_FLAGS_PUT | MCA_BTL_FLAGS_ATOMIC_FOPS );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not comfortable with this approach; we should not be modifying the BTL module's flags. Those are owned by the BTL itself. Why do we need to do this (rather than use our own state for capabilities)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary because mca_am_rdma_init will only fill the gaps, e.g. if a btl does not support put/get/atomics, mca_am_rdma_init add the support. mca_am_rdma_init will not overwrite btl's native support.

but what osc/rdma want is for an alternate btl to always use AM rdma, do not use its native put/get/atomics.

Maybe we can add an flag to mca_am_rdma_init to force it overwrite a btl's native implementation of rdma/atomics?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should go the flags to am_rdma_init route. I also don't know that we want to overwrite the btl's native implementation; remember that the PML is using the BTL at the same time.

I think we want a proxy btl for the one-sided to use, which is just the underlying BTL and all the AM structures. So less am_rdma_init() and more btl* btl_base_am_rdma_alloc(base_btl);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice. I will add a function btl_base_am_rdma_alloc then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwbarrett I tried to allocate a new btl module, but encountered an issue.

The newly created btls are stored in "module->selected_btls. Later when creating peer, osc/rdma need to pick a btl (from module->selected_btls) for each peer. This is currently done by comparing pointer in module->selected_btlswith pointer of btl registered with bml. Because the newly created btls are not registered with bml, osc/rdma cannot find a btl (and an endpoint) for the peer. (this happens in functionompi_osc_rdma_peer_btl_endpoint`)

I wonder what is the best way to handle this issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we'd need to go the whole way and import into the BML if we go down that path. I sent a long email to you and Nathan about some alternatives that might be less costly. Let's run that to ground and come back to this discussion.

@wzamazon wzamazon force-pushed the osc_rdma_adjust_btl_selection_logic branch from b334866 to 59a4528 Compare November 24, 2021 00:11
@ibm-ompi
Copy link

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/4978a3cec5b53fa9b1b3a979a2ceb2f9

@ibm-ompi
Copy link

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/09c1eb4b879d0c01b6243d77e637922b

@ibm-ompi
Copy link

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/17465f4ecc97b9fe4cf924273b1608f8

@wzamazon
Copy link
Contributor Author

address #8983 and #7830

@gpaulsen
Copy link
Member

gpaulsen commented Dec 2, 2021

@wzamazon

  CC       osc_rdma_component.lo
In file included from ../../../../opal/util/arch.h:25:0,
                 from osc_rdma_component.c:48:
osc_rdma_component.c: In function 'ompi_osc_rdma_query_alternate_btls':
osc_rdma_component.c:965:50: error: 'MCA_BTL_FLAGS_RDMA_REMOTE_COMPLETION' undeclared (first use in this function)
             assert(item->btl_module->btl_flags & MCA_BTL_FLAGS_RDMA_REMOTE_COMPLETION);
                                                  ^
osc_rdma_component.c:965:50: note: each undeclared identifier is reported only once for each function it appears in
osc_rdma_component.c: In function 'ompi_osc_rdma_query_btls':
osc_rdma_component.c:1008:108: error: 'MCA_BTL_FLAGS_RDMA_REMOTE_COMPLETION' undeclared (first use in this function)
                     (item->btl_module->btl_flags & (MCA_BTL_FLAGS_ATOMIC_FOPS | MCA_BTL_FLAGS_ATOMIC_OPS | MCA_BTL_FLAGS_RDMA_REMOTE_COMPLETION))) {
                                                                                                            ^
make[2]: *** [osc_rdma_component.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory `/__w/1/ompi/ompi/mca/osc/rdma'

@wzamazon
Copy link
Contributor Author

wzamazon commented Dec 2, 2021

bot:aws:retest

@bwbarrett bwbarrett marked this pull request as draft December 2, 2021 16:20
@wzamazon
Copy link
Contributor Author

wzamazon commented Dec 2, 2021

@gpaulsen

The flag MCA_BTL_FLAGS_RDMA_REMOTE_COMPLETION was introduced in

#9694

which was not merged when this PR was opened, thus caused the compilation error.

Retest should pass

@wzamazon
Copy link
Contributor Author

wzamazon commented Dec 2, 2021

bot:ibm:retest

@wzamazon
Copy link
Contributor Author

wzamazon commented Dec 2, 2021

@gpaulsen I am not sure how to restart the ibm and mellanox CI. Can you help?

@awlauria
Copy link
Contributor

awlauria commented Dec 2, 2021

bot:ibm:retest

@awlauria
Copy link
Contributor

awlauria commented Dec 2, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@wzamazon
Copy link
Contributor Author

wzamazon commented Dec 2, 2021

@awlauria Thank you!

This patch makes active message (AM) to not use get to implement put,
because AM put based on get does not support remote completion.

By making this change, active message put will support remote
completion.

Other active message operations: get, fop and cswap support
remote completion by the nature of the completion.

Therefore, this patch ensures active message RDMA/atomics always
support remote completion.

Signed-off-by: Wei Zhang <[email protected]>
This patch introduced a new struct ompi_osc_rdma_btl_wrapper_t,
which contains all the information osc/rdma uses in a
mca_btl_base_module_t object.

With this change, a ompi_osc_rdma_btl_wrapper_t struct will
be created when a mca_btl_base_module was selected, and stored
in "selected_btl"

osc/rdma then operates on an ompi_osc_rdma_btl_wrapper_t
object instead of the mca_btl_base_module_t object directly.

This change is a preparation for an upcoming change on btl
selection logic.

Signed-off-by: Wei Zhang <[email protected]>
This patch makes the following adjustments to the btl selection logic.

First, when selecting a primary btl, this patch requires the primary
btl to support remote completion.

Second, when selecting alternate btls, this patch allowes any
btl to be used as an alternate btl. Prior to this patch,
only a list of pre-defined btls can be used as alternate btls.

Finally, when a btl is used as an atlernate btl, this patch
create a btl wrapper such that osc/rdma will always use
active message RDMA/atomics on this alternate btl.

The reason for these changes are:

First, these changes ensured the selected btls of osc/rdma always support
remote completion (because active message RDMA/atomics supports remote completion).
Remote completion is essential for several key components of osc/rdma:

   the usage of cpu atomics to update peer's state,
   the usage of local leader to update peer's state and,
   its fence implementation.

Therefore the assurance of btl's support of remote completion greatly
simplified osc/rdma's implementation.

Second, these changes eliminated the need to save and exchange more
than 1 memory reigstration, because active message RDMA/atomic does
not require explicit memory registration.

Third, these changes ensured the correctness of atomic operations.
When multiple alternate btls are used, atomicity cannot be guarenteed
accross each btl's native implementation of atomics.

Finally, these changes allowed more btls to be used as alternate
btls, especially btl/self.

Signed-off-by: Wei Zhang <[email protected]>
@wzamazon wzamazon force-pushed the osc_rdma_adjust_btl_selection_logic branch from 59a4528 to b98bba5 Compare December 12, 2021 04:09
@wzamazon wzamazon marked this pull request as ready for review December 12, 2021 05:26
@wzamazon
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@wzamazon wzamazon requested review from jsquyres and removed request for bwbarrett December 12, 2021 05:27
@wzamazon wzamazon closed this Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants