Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osc/rdma, btl/tcp: fix various issues with osc/rdma #9400

Closed

Conversation

wzamazon
Copy link
Contributor

@wzamazon wzamazon commented Sep 21, 2021

This PR addresses #8983 and #7830

@hjelmn
Copy link
Member

hjelmn commented Sep 22, 2021

Will have to look through this. We use a single state endpoint per node with real RDMA networks for both correctness and performance reasons. Does this PR change all cases or just the multi-btl case?

@wzamazon
Copy link
Contributor Author

This PR is change the no CPU atomics code path, for single and multi-btl.

The intention is to fix the data corruption when the single btl is btl/tcp #8983.

@awlauria Would you please test the PR on your POWER system to see if it fixed that issue.

@awlauria
Copy link
Contributor

I can confirm it works on a single node. However, running across two nodes I see the same issue pop up again at > 2 ranks.

That is a huge improvement, and might be a separate issue related to #7830

@wzamazon
Copy link
Contributor Author

@awlauria Thank you for testing! I will look into the 2 nodes case.

@hjelmn I added the "WIP-DNM" label as we discuss what is the right path to take. I am especially interested in the ugni case you mentioned, which requires using local leader and will be broken by this PR.

@wzamazon wzamazon force-pushed the osc_rdma_single_channel_shared_comm branch from e31a4de to b95b89a Compare September 22, 2021 23:43
@wzamazon
Copy link
Contributor Author

@awlauria The 2 nodes failure is because local leader is also used on multiple nodes communication. To give an example, say we have 4 ranks split on 2 nodes: rank 0 and 1 are on node 0, rank 2 and 3 are on node 1.

rank 0 and rank 2 are local leader.

rank 2 is going to do one-sided operation on rank 1, for which rank2 communicate with rank 1 directly.

After rank 2 is done, rank 2 must update counter on rank 1. Now instead of communicate directly with rank 1, rank 2 communicate with rank 0 (who is leader of rank 1) to update a memory region on rank 0, which maps to the counter on rank 1.

I just updated the PR to not use local leader for multi node, so the new PR should work for tcp on 2 nodes.

However, I believe removing local leader would cause performance degradation for existing btls that can use (and benefit) from it, which is undesirable.

Because btl/tcp is the outlier here (that does not work with local leader) , I think it is better to introduce a flag to indicate a btl cannot use local leader, and have btl/tcp toggle it on.

Meanwhile, instead of removing the local leader with the new PR, we add it alongside with local leader, and only use the new PR with btl/tcp (identified by the new flag).

@hjelmn does the plan sound right to you?

The MCA_BTL_ATOMIC_SUPPORTS_GLOB flag indicates whether a btl atomics
can be mixed with CPU's atomics operations.

The active message RDMA does support this capability, therefore
this patch unset the flag for active message RDMA.

btl/ofi's RDMA does support this capability because it requested
delivery complete from libfabric, therefore this patch set the flag
for btl/ofi.

Signed-off-by: Wei Zhang <[email protected]>
@wzamazon
Copy link
Contributor Author

I realized that I do not need to introduce new btl flag, and I can use existing flags that indicate active message put and atomics are used, because this the case local leader optimization does not work.

@hjelmn The latest version of PR limits the change to when active message put/atomics is used. Please take a look!

@awlauria Would you please test the latest version of the PR? It should work both on single node and on multiple nodes.

Thank you!

@wzamazon wzamazon changed the title osc/rdma, btl/tcp: use one communcation channel to update peer's counter if cpu atomics is not used osc/rdma, btl/tcp: use one communcation channel to update peer's counter if active message RDMA is being used Sep 23, 2021
@wzamazon wzamazon force-pushed the osc_rdma_single_channel_shared_comm branch from ef68652 to 0e5d66a Compare September 24, 2021 13:17
@wzamazon
Copy link
Contributor Author

The Pull Request Build Checker failure is because some git checkout issue.

> git checkout -f 5b07dc4d86538e5517cbf8ab8466274b9cc89fdf # timeout=10
FATAL: Could not checkout 5b07dc4d86538e5517cbf8ab8466274b9cc89fdf
hudson.plugins.git.GitException: Command "git checkout -f 5b07dc4d86538e5517cbf8ab8466274b9cc89fdf" returned status code 128:
stdout: 
stderr: fatal: reference is not a tree: 5b07dc4d86538e5517cbf8ab8466274b9cc89fdf

Is there a way to re-trigger the test?

@awlauria
Copy link
Contributor

bot:aws:retest

The flag "using_cpu_atomics" indicates whether CPU
atomcis can be use to update counters for osc/rdma
peers on the same instance.

This patch fixed two bugs in the way "using_cpu_atomics"
is evaluated in allocate_state_shared():

First, current code used "btl->btl_flags", which is a typo.
The right fla to use is "btl->btl_atomic_flags".

Second, current code always set "using_cpu_atomics" to true
on single node, which is not right because even on single
node, the selected btl does not necessary support
mixed use of atomics.

Signed-off-by: Wei Zhang <[email protected]>
osc_rdma_new_peer() creates a new peer. It should fail
when the following 3 conditions are all met:

1. cannot find an endpoint
2. the selected btl does not support atomic global
3. the peer is not self

However, the current code fails when either 2 or 3 is met.
This patch fixed it.

Signed-off-by: Wei Zhang <[email protected]>
The local leader optimization means that:

  on each node, a process was designated as the local leader,
  who setup shared memory, and other processes on the same
  node would map their states to local leader's shared memory.

  When a process try to update a peer process's state, the
  process will do that through atomic actions on local leader's
  memory. The peer's state is then updated through shard memory.

Essentially, using local leader optimization means two different
channels are used to transfer data and to update peer's update.

This optimization is incorrect for BTL that uses active message RDMA
. Active message RDMA uses send/receive to emulate put and atomics,
and its put implementation does not delivery complete, e.g. when the
initiator got completion for a put action, it only means data has been sent.
it does not mean the data has been delivered to the target buffer.

Therefore, if peer's state is updated through a different communication
channel, it can happen that peer's state is updated before the put
action is completed on the peer, which will cause data corruption.

This patch made the change that: for active message RDMA, peer's state
is updated using the same channel data is transferred (e.g diabling
the local leader optimization).

To achieve that, each process need to have the pointer to each peer's
state, for which this patch introduced a function gather_peer_state().
Note because active message RDMA does not use memory registration,
the state_handle is not gathered.

This patch then sets peer's state pointer using gathered information,
and use the same endpoint to update data and transfer data.

Signed-off-by: Wei Zhang <[email protected]>
Current shm file name used by osc/rdma is not unique when multiple
communicator are on the same instance. This patch add the local
leader's pid to the file name to address the issue.

Signed-off-by: Wei Zhang <[email protected]>
Currently, each peer keep a state_btl_index and a data_btl_index.
The btl_index is used to retrive btl used for data transfer and
state updating, by calling function mpi_osc_rdma_selected_btl().

This patch simplify the code by directly storing the pointer to
the btl inside peer, thus bypassing the call to
mpi_osc_rdma_selected_btl().

The function mpi_osc_rdma_selected_btl() is then removed.

Signed-off-by: Wei Zhang <[email protected]>
ompi_osc_rdma_peer_btl_endpoint() is used to select btl and endpoint
to communicate with a peer.

This patch added a change to ompi_osc_rdma_peer_btl_endpoint() that:
if no btl/endpoint has been selected for self communication,
and if bml has the btl/self, then use btl/self for self communication.

It also made a change to ompi_osc_rdma_new_peer():

Currently if no btl can be found and the peer is self, the function
still continues. This patch made the function to fail in this case,
because the ability to do self communication is essential for osc/rdma.

Signed-off-by: Wei Zhang <[email protected]>
The flag MCA_BTL_ATOMIC_SUPPORTS_GLOB indicates that a btl supports
mix use CPU atomics with its own atomics. The btl/self supports this
because for btl/self the two are the same thing. Therefore this
patch added the flag to btl_atomic_flags of btl/self.

Signed-off-by: Wei Zhang <[email protected]>
Currently, use_cpu_atomics is a variable in osc_rdma_module.
However, whether CPU atomics can be used is a peer level decision.
So this patch remove use_cpu_atomics from osc_rdma_module, and
evaluate it for each peer.

Signed-off-by: Wei Zhang <[email protected]>
@wzamazon wzamazon force-pushed the osc_rdma_single_channel_shared_comm branch from 0e5d66a to 5d9e5a2 Compare September 27, 2021 20:35
@wzamazon wzamazon changed the title osc/rdma, btl/tcp: use one communcation channel to update peer's counter if active message RDMA is being used osc/rdma, btl/tcp: fix various issues with osc/rdma Sep 27, 2021
@gpaulsen
Copy link
Member

@wzamazon Is this PR ready for review, or are you still iterating on it?

@wzamazon
Copy link
Contributor Author

It is ready for review.

@wzamazon
Copy link
Contributor Author

I am still having some errors when running ompi-tests/onesided tests, and I am investigating.

But commits in this PR are necessary to fix existing issues regardless what the investigation turned out.

@wzamazon
Copy link
Contributor Author

@gpaulsen I had a discussion with @bwbarrett the current approach of using btl/self is not desirable, so I will try a different (more general) approach. Please hold off reviewing.

@@ -393,7 +393,8 @@ mca_btl_ofi_module_t *mca_btl_ofi_module_alloc(int mode)
module->super.btl_flags |= MCA_BTL_FLAGS_ATOMIC_FOPS | MCA_BTL_FLAGS_ATOMIC_OPS
| MCA_BTL_FLAGS_RDMA;

module->super.btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_ADD | MCA_BTL_ATOMIC_SUPPORTS_SWAP
module->super.btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_GLOB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right, and the commit comment for this change is definitely wrong. The EFA provider can guarantee GLOB not because of the delivery complete request, but because atomics are implemented in software. For providers that implement atomics in hardware, GLOB may not be supportable.

@@ -1114,7 +1114,7 @@ int mca_btl_base_am_rdma_init(mca_btl_base_module_t *btl)

/* emulated RDMA atomics can support the full range of atomics. for
* now only a handful are supported. */
btl->btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_GLOB | MCA_BTL_ATOMIC_SUPPORTS_CSWAP
btl->btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_CSWAP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow; something else is missing here. The AM atomics are consistent with the CPU atomics, since the AM atomics are just implemented as CPU atomics. It's possible that some completion semantic is not correct for the OSC component, but that's a different problem.

Whatever the reason, someone is going to come to the same conclusion as I did and re-insert the GLOB flag. So once we're done figuring out the missing semantic that led you here, we need to update the comment on line 1116.

if (OPAL_UNLIKELY(OMPI_SUCCESS != ret && !((module->selected_btls[0]->btl_atomic_flags & MCA_BTL_ATOMIC_SUPPORTS_GLOB) &&
peer_id == ompi_comm_rank (module->comm)))) {
if (OPAL_UNLIKELY(OMPI_SUCCESS != ret &&
!(module->selected_btls[0]->btl_atomic_flags & MCA_BTL_ATOMIC_SUPPORTS_GLOB) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming your earlier patch was applied, is this even somewhere we can get? I'm not sure why this is a run-time test instead of an assert?

* local leader to update peer's state through shared memory region.
* BTLs that uses active message RDMA cannot support such optimization,
* because active message RDMA uses send/receive to emulate put and
* atomics, so the atomcis and RMA operation must be through the same
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the right solution to the problem. The problem with the local leader optimization is that it assume ordering that AM doesn't give, but it's unclear to me that the entire implementation of OSC RDMA can deal with the AM ordering guarantees. I think we need to step back and answer the ordering guarantees question before adding this complexity.

return OMPI_ERR_OUT_OF_RESOURCE;
}

ret = module->comm->c_coll->coll_allgather(&module->state, sizeof(uintptr_t), MPI_BYTE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the entire component isn't heterogeneous safe, but it's sad to see this code. It's not hard to use MPI's datatype engine to do the conversions, rather than sending a byte stream.

return OMPI_ERR_UNREACH;
}
if (module->use_local_leader) {
/* each node is responsible for holding a part of the rank -> node/local rank mapping array. this code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pull this out into its own function

mca_osc_rdma_component.backing_directory, ompi_process_info.nodename,
OMPI_PROC_MY_NAME->jobid, ompi_comm_get_cid(module->comm));
OMPI_PROC_MY_NAME->jobid, ompi_comm_get_cid(module->comm), getpid());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message for this change talks about multiple communicators on teh same instance, but we're adding a pid. I'm not sure how that helps?

/** index into BTL array */
uint8_t state_btl_index;
/** btl used for rdma */
struct mca_btl_base_module_t *data_btl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference in the size of the ompi_osc_rdma_peer_t structure with this change and how does that compare with the overall size of the structure? I think it's going to be 12 bytes bigger, but I may be getting the padding / alignment wrong, because I did it in my head. While 12 bytes isn't much, there is one of these structures per peer, and that can be millions of peers. Now, I'd argue that this structure is way too big anyway, but ignoring that, I'm pretty sure this was a space optimization and we should think about the implications of that.

@@ -73,7 +73,19 @@ static int ompi_osc_rdma_peer_btl_endpoint (struct ompi_osc_rdma_module_t *modul
}
}

/* unlikely but can happen when creating a peer for self */
if (peer_id == ompi_comm_rank (module->comm)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really concerned about this approach. I suppose it works, but it feels like the right place for this decision is during the initial btl selection logic.

@@ -107,6 +107,7 @@ static int mca_btl_self_component_register(void)
mca_btl_self.btl_rdma_pipeline_frag_size = INT_MAX;
mca_btl_self.btl_min_rdma_pipeline_size = 0;
mca_btl_self.btl_flags = MCA_BTL_FLAGS_RDMA | MCA_BTL_FLAGS_SEND_INPLACE | MCA_BTL_FLAGS_SEND;
mca_btl_self.btl_atomic_flags = MCA_BTL_ATOMIC_SUPPORTS_GLOB;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense to me, since the self BTL doesn't support atomics?

@awlauria
Copy link
Contributor

@wzamazon sorry - was out these past few days, and unable to test before I left. Will try it out today or tomorrow, thanks!

@wzamazon
Copy link
Contributor Author

This PR need some rework from beginning. Closing for now.

@wzamazon wzamazon closed this Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants