osc/rdma, btl/tcp: fix various issues with osc/rdma #9400

wzamazon · 2021-09-21T23:27:09Z

This PR addresses #8983 and #7830

hjelmn · 2021-09-22T01:24:03Z

Will have to look through this. We use a single state endpoint per node with real RDMA networks for both correctness and performance reasons. Does this PR change all cases or just the multi-btl case?

wzamazon · 2021-09-22T13:52:08Z

This PR is change the no CPU atomics code path, for single and multi-btl.

The intention is to fix the data corruption when the single btl is btl/tcp #8983.

@awlauria Would you please test the PR on your POWER system to see if it fixed that issue.

awlauria · 2021-09-22T15:04:21Z

I can confirm it works on a single node. However, running across two nodes I see the same issue pop up again at > 2 ranks.

That is a huge improvement, and might be a separate issue related to #7830

wzamazon · 2021-09-22T16:03:02Z

@awlauria Thank you for testing! I will look into the 2 nodes case.

@hjelmn I added the "WIP-DNM" label as we discuss what is the right path to take. I am especially interested in the ugni case you mentioned, which requires using local leader and will be broken by this PR.

wzamazon · 2021-09-22T23:54:11Z

@awlauria The 2 nodes failure is because local leader is also used on multiple nodes communication. To give an example, say we have 4 ranks split on 2 nodes: rank 0 and 1 are on node 0, rank 2 and 3 are on node 1.

rank 0 and rank 2 are local leader.

rank 2 is going to do one-sided operation on rank 1, for which rank2 communicate with rank 1 directly.

After rank 2 is done, rank 2 must update counter on rank 1. Now instead of communicate directly with rank 1, rank 2 communicate with rank 0 (who is leader of rank 1) to update a memory region on rank 0, which maps to the counter on rank 1.

I just updated the PR to not use local leader for multi node, so the new PR should work for tcp on 2 nodes.

However, I believe removing local leader would cause performance degradation for existing btls that can use (and benefit) from it, which is undesirable.

Because btl/tcp is the outlier here (that does not work with local leader) , I think it is better to introduce a flag to indicate a btl cannot use local leader, and have btl/tcp toggle it on.

Meanwhile, instead of removing the local leader with the new PR, we add it alongside with local leader, and only use the new PR with btl/tcp (identified by the new flag).

@hjelmn does the plan sound right to you?

The MCA_BTL_ATOMIC_SUPPORTS_GLOB flag indicates whether a btl atomics can be mixed with CPU's atomics operations. The active message RDMA does support this capability, therefore this patch unset the flag for active message RDMA. btl/ofi's RDMA does support this capability because it requested delivery complete from libfabric, therefore this patch set the flag for btl/ofi. Signed-off-by: Wei Zhang <[email protected]>

wzamazon · 2021-09-23T20:28:47Z

I realized that I do not need to introduce new btl flag, and I can use existing flags that indicate active message put and atomics are used, because this the case local leader optimization does not work.

@hjelmn The latest version of PR limits the change to when active message put/atomics is used. Please take a look!

@awlauria Would you please test the latest version of the PR? It should work both on single node and on multiple nodes.

Thank you!

wzamazon · 2021-09-24T16:25:16Z

The Pull Request Build Checker failure is because some git checkout issue.

> git checkout -f 5b07dc4d86538e5517cbf8ab8466274b9cc89fdf # timeout=10
FATAL: Could not checkout 5b07dc4d86538e5517cbf8ab8466274b9cc89fdf
hudson.plugins.git.GitException: Command "git checkout -f 5b07dc4d86538e5517cbf8ab8466274b9cc89fdf" returned status code 128:
stdout: 
stderr: fatal: reference is not a tree: 5b07dc4d86538e5517cbf8ab8466274b9cc89fdf

Is there a way to re-trigger the test?

awlauria · 2021-09-24T16:37:53Z

bot:aws:retest

The flag "using_cpu_atomics" indicates whether CPU atomcis can be use to update counters for osc/rdma peers on the same instance. This patch fixed two bugs in the way "using_cpu_atomics" is evaluated in allocate_state_shared(): First, current code used "btl->btl_flags", which is a typo. The right fla to use is "btl->btl_atomic_flags". Second, current code always set "using_cpu_atomics" to true on single node, which is not right because even on single node, the selected btl does not necessary support mixed use of atomics. Signed-off-by: Wei Zhang <[email protected]>

osc_rdma_new_peer() creates a new peer. It should fail when the following 3 conditions are all met: 1. cannot find an endpoint 2. the selected btl does not support atomic global 3. the peer is not self However, the current code fails when either 2 or 3 is met. This patch fixed it. Signed-off-by: Wei Zhang <[email protected]>

The local leader optimization means that: on each node, a process was designated as the local leader, who setup shared memory, and other processes on the same node would map their states to local leader's shared memory. When a process try to update a peer process's state, the process will do that through atomic actions on local leader's memory. The peer's state is then updated through shard memory. Essentially, using local leader optimization means two different channels are used to transfer data and to update peer's update. This optimization is incorrect for BTL that uses active message RDMA . Active message RDMA uses send/receive to emulate put and atomics, and its put implementation does not delivery complete, e.g. when the initiator got completion for a put action, it only means data has been sent. it does not mean the data has been delivered to the target buffer. Therefore, if peer's state is updated through a different communication channel, it can happen that peer's state is updated before the put action is completed on the peer, which will cause data corruption. This patch made the change that: for active message RDMA, peer's state is updated using the same channel data is transferred (e.g diabling the local leader optimization). To achieve that, each process need to have the pointer to each peer's state, for which this patch introduced a function gather_peer_state(). Note because active message RDMA does not use memory registration, the state_handle is not gathered. This patch then sets peer's state pointer using gathered information, and use the same endpoint to update data and transfer data. Signed-off-by: Wei Zhang <[email protected]>

Current shm file name used by osc/rdma is not unique when multiple communicator are on the same instance. This patch add the local leader's pid to the file name to address the issue. Signed-off-by: Wei Zhang <[email protected]>

Currently, each peer keep a state_btl_index and a data_btl_index. The btl_index is used to retrive btl used for data transfer and state updating, by calling function mpi_osc_rdma_selected_btl(). This patch simplify the code by directly storing the pointer to the btl inside peer, thus bypassing the call to mpi_osc_rdma_selected_btl(). The function mpi_osc_rdma_selected_btl() is then removed. Signed-off-by: Wei Zhang <[email protected]>

ompi_osc_rdma_peer_btl_endpoint() is used to select btl and endpoint to communicate with a peer. This patch added a change to ompi_osc_rdma_peer_btl_endpoint() that: if no btl/endpoint has been selected for self communication, and if bml has the btl/self, then use btl/self for self communication. It also made a change to ompi_osc_rdma_new_peer(): Currently if no btl can be found and the peer is self, the function still continues. This patch made the function to fail in this case, because the ability to do self communication is essential for osc/rdma. Signed-off-by: Wei Zhang <[email protected]>

The flag MCA_BTL_ATOMIC_SUPPORTS_GLOB indicates that a btl supports mix use CPU atomics with its own atomics. The btl/self supports this because for btl/self the two are the same thing. Therefore this patch added the flag to btl_atomic_flags of btl/self. Signed-off-by: Wei Zhang <[email protected]>

Currently, use_cpu_atomics is a variable in osc_rdma_module. However, whether CPU atomics can be used is a peer level decision. So this patch remove use_cpu_atomics from osc_rdma_module, and evaluate it for each peer. Signed-off-by: Wei Zhang <[email protected]>

gpaulsen · 2021-09-28T16:21:45Z

@wzamazon Is this PR ready for review, or are you still iterating on it?

wzamazon · 2021-09-28T16:24:29Z

It is ready for review.

wzamazon · 2021-09-28T16:43:34Z

I am still having some errors when running ompi-tests/onesided tests, and I am investigating.

But commits in this PR are necessary to fix existing issues regardless what the investigation turned out.

wzamazon · 2021-09-28T17:24:47Z

@gpaulsen I had a discussion with @bwbarrett the current approach of using btl/self is not desirable, so I will try a different (more general) approach. Please hold off reviewing.

bwbarrett · 2021-09-28T17:14:08Z

opal/mca/btl/ofi/btl_ofi_module.c

@@ -393,7 +393,8 @@ mca_btl_ofi_module_t *mca_btl_ofi_module_alloc(int mode)
        module->super.btl_flags |= MCA_BTL_FLAGS_ATOMIC_FOPS | MCA_BTL_FLAGS_ATOMIC_OPS
                                   | MCA_BTL_FLAGS_RDMA;

-        module->super.btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_ADD | MCA_BTL_ATOMIC_SUPPORTS_SWAP
+        module->super.btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_GLOB


I don't think this is right, and the commit comment for this change is definitely wrong. The EFA provider can guarantee GLOB not because of the delivery complete request, but because atomics are implemented in software. For providers that implement atomics in hardware, GLOB may not be supportable.

bwbarrett · 2021-09-28T17:22:01Z

opal/mca/btl/base/btl_base_am_rdma.c

@@ -1114,7 +1114,7 @@ int mca_btl_base_am_rdma_init(mca_btl_base_module_t *btl)

        /* emulated RDMA atomics can support the full range of atomics. for
         * now only a handful are supported. */
-        btl->btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_GLOB | MCA_BTL_ATOMIC_SUPPORTS_CSWAP
+        btl->btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_CSWAP


I'm not sure I follow; something else is missing here. The AM atomics are consistent with the CPU atomics, since the AM atomics are just implemented as CPU atomics. It's possible that some completion semantic is not correct for the OSC component, but that's a different problem.

Whatever the reason, someone is going to come to the same conclusion as I did and re-insert the GLOB flag. So once we're done figuring out the missing semantic that led you here, we need to update the comment on line 1116.

bwbarrett · 2021-09-28T19:25:10Z

ompi/mca/osc/rdma/osc_rdma_peer.c

-    if (OPAL_UNLIKELY(OMPI_SUCCESS != ret && !((module->selected_btls[0]->btl_atomic_flags & MCA_BTL_ATOMIC_SUPPORTS_GLOB) &&
-                                               peer_id == ompi_comm_rank (module->comm)))) {
+    if (OPAL_UNLIKELY(OMPI_SUCCESS != ret &&
+                      !(module->selected_btls[0]->btl_atomic_flags & MCA_BTL_ATOMIC_SUPPORTS_GLOB) &&


Assuming your earlier patch was applied, is this even somewhere we can get? I'm not sure why this is a run-time test instead of an assert?

bwbarrett · 2021-09-28T23:49:54Z

ompi/mca/osc/rdma/osc_rdma.h

+     * local leader to update peer's state through shared memory region.
+     * BTLs that uses active message RDMA cannot support such optimization,
+     * because active message RDMA uses send/receive to emulate put and
+     * atomics, so the atomcis and RMA operation must be through the same


I'm not sure this is the right solution to the problem. The problem with the local leader optimization is that it assume ordering that AM doesn't give, but it's unclear to me that the entire implementation of OSC RDMA can deal with the AM ordering guarantees. I think we need to step back and answer the ordering guarantees question before adding this complexity.

bwbarrett · 2021-09-28T23:52:32Z

ompi/mca/osc/rdma/osc_rdma_component.c

+        return OMPI_ERR_OUT_OF_RESOURCE;
+    }
+
+    ret = module->comm->c_coll->coll_allgather(&module->state, sizeof(uintptr_t), MPI_BYTE,


I'm guessing the entire component isn't heterogeneous safe, but it's sad to see this code. It's not hard to use MPI's datatype engine to do the conversions, rather than sending a byte stream.

bwbarrett · 2021-09-28T23:56:42Z

ompi/mca/osc/rdma/osc_rdma_peer.c

-        return OMPI_ERR_UNREACH;
-    }
+    if (module->use_local_leader) {
+        /* each node is responsible for holding a part of the rank -> node/local rank mapping array. this code


pull this out into its own function

bwbarrett · 2021-09-28T23:57:39Z

ompi/mca/osc/rdma/osc_rdma_component.c

                            mca_osc_rdma_component.backing_directory, ompi_process_info.nodename,
-                            OMPI_PROC_MY_NAME->jobid, ompi_comm_get_cid(module->comm));
+                            OMPI_PROC_MY_NAME->jobid, ompi_comm_get_cid(module->comm), getpid());


The commit message for this change talks about multiple communicators on teh same instance, but we're adding a pid. I'm not sure how that helps?

bwbarrett · 2021-09-29T00:01:36Z

ompi/mca/osc/rdma/osc_rdma_peer.h

-    /** index into BTL array */
-    uint8_t state_btl_index;
+    /** btl used for rdma */
+    struct mca_btl_base_module_t *data_btl;


What's the difference in the size of the ompi_osc_rdma_peer_t structure with this change and how does that compare with the overall size of the structure? I think it's going to be 12 bytes bigger, but I may be getting the padding / alignment wrong, because I did it in my head. While 12 bytes isn't much, there is one of these structures per peer, and that can be millions of peers. Now, I'd argue that this structure is way too big anyway, but ignoring that, I'm pretty sure this was a space optimization and we should think about the implications of that.

bwbarrett · 2021-09-29T00:07:01Z

ompi/mca/osc/rdma/osc_rdma_peer.c

@@ -73,7 +73,19 @@ static int ompi_osc_rdma_peer_btl_endpoint (struct ompi_osc_rdma_module_t *modul
        }
    }

-    /* unlikely but can happen when creating a peer for self */
+    if (peer_id == ompi_comm_rank (module->comm)) {


I'm really concerned about this approach. I suppose it works, but it feels like the right place for this decision is during the initial btl selection logic.

bwbarrett · 2021-09-29T00:08:18Z

opal/mca/btl/self/btl_self_component.c

@@ -107,6 +107,7 @@ static int mca_btl_self_component_register(void)
    mca_btl_self.btl_rdma_pipeline_frag_size = INT_MAX;
    mca_btl_self.btl_min_rdma_pipeline_size = 0;
    mca_btl_self.btl_flags = MCA_BTL_FLAGS_RDMA | MCA_BTL_FLAGS_SEND_INPLACE | MCA_BTL_FLAGS_SEND;
+    mca_btl_self.btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_GLOB;


This doesn't make sense to me, since the self BTL doesn't support atomics?

awlauria · 2021-09-30T13:22:53Z

@wzamazon sorry - was out these past few days, and unable to test before I left. Will try it out today or tomorrow, thanks!

wzamazon · 2021-10-12T15:34:21Z

This PR need some rework from beginning. Closing for now.

wzamazon requested review from hjelmn, bwbarrett and awlauria September 21, 2021 23:27

wzamazon mentioned this pull request Sep 21, 2021

osc rdma/tcp: Wrong answer with c_post_start in MTT #8983

Closed

wzamazon added the ⚠️ WIP-DNM! label Sep 22, 2021

wzamazon force-pushed the osc_rdma_single_channel_shared_comm branch from e31a4de to b95b89a Compare September 22, 2021 23:43

wzamazon force-pushed the osc_rdma_single_channel_shared_comm branch from b95b89a to ef68652 Compare September 23, 2021 20:25

wzamazon removed the ⚠️ WIP-DNM! label Sep 23, 2021

wzamazon changed the title ~~osc/rdma, btl/tcp: use one communcation channel to update peer's counter if cpu atomics is not used~~ osc/rdma, btl/tcp: use one communcation channel to update peer's counter if active message RDMA is being used Sep 23, 2021

wzamazon force-pushed the osc_rdma_single_channel_shared_comm branch from ef68652 to 0e5d66a Compare September 24, 2021 13:17

bosilca mentioned this pull request Sep 24, 2021

Fix OSC RDMA to work in cases where one BTL cannot reach all communicator group members #7830

Closed

wzamazon added 8 commits September 27, 2021 19:25

osc/rdma: add pid to shm file name

3c0ae03

Current shm file name used by osc/rdma is not unique when multiple communicator are on the same instance. This patch add the local leader's pid to the file name to address the issue. Signed-off-by: Wei Zhang <[email protected]>

wzamazon force-pushed the osc_rdma_single_channel_shared_comm branch from 0e5d66a to 5d9e5a2 Compare September 27, 2021 20:35

wzamazon changed the title ~~osc/rdma, btl/tcp: use one communcation channel to update peer's counter if active message RDMA is being used~~ osc/rdma, btl/tcp: fix various issues with osc/rdma Sep 27, 2021

bwbarrett requested changes Sep 29, 2021

View reviewed changes

wzamazon mentioned this pull request Sep 30, 2021

ompi-tests/oneside/test_start2 hang with large group size on ompi master #9447

Closed

wzamazon closed this Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osc/rdma, btl/tcp: fix various issues with osc/rdma #9400

osc/rdma, btl/tcp: fix various issues with osc/rdma #9400

wzamazon commented Sep 21, 2021 •

edited

Loading

hjelmn commented Sep 22, 2021

wzamazon commented Sep 22, 2021

awlauria commented Sep 22, 2021

wzamazon commented Sep 22, 2021

wzamazon commented Sep 22, 2021

wzamazon commented Sep 23, 2021

wzamazon commented Sep 24, 2021

awlauria commented Sep 24, 2021

gpaulsen commented Sep 28, 2021

wzamazon commented Sep 28, 2021

wzamazon commented Sep 28, 2021

wzamazon commented Sep 28, 2021

bwbarrett Sep 28, 2021

bwbarrett Sep 28, 2021

bwbarrett Sep 28, 2021

bwbarrett Sep 28, 2021

bwbarrett Sep 28, 2021

bwbarrett Sep 28, 2021

bwbarrett Sep 28, 2021

bwbarrett Sep 29, 2021

bwbarrett Sep 29, 2021

bwbarrett Sep 29, 2021

awlauria commented Sep 30, 2021

wzamazon commented Oct 12, 2021

osc/rdma, btl/tcp: fix various issues with osc/rdma #9400

osc/rdma, btl/tcp: fix various issues with osc/rdma #9400

Conversation

wzamazon commented Sep 21, 2021 • edited Loading

hjelmn commented Sep 22, 2021

wzamazon commented Sep 22, 2021

awlauria commented Sep 22, 2021

wzamazon commented Sep 22, 2021

wzamazon commented Sep 22, 2021

wzamazon commented Sep 23, 2021

wzamazon commented Sep 24, 2021

awlauria commented Sep 24, 2021

gpaulsen commented Sep 28, 2021

wzamazon commented Sep 28, 2021

wzamazon commented Sep 28, 2021

wzamazon commented Sep 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awlauria commented Sep 30, 2021

wzamazon commented Oct 12, 2021

wzamazon commented Sep 21, 2021 •

edited

Loading