Skip to content

Commit

Permalink
pml/ob1: fix potential double return of RDMA fragment on get operatio…
Browse files Browse the repository at this point in the history
…n failure

The mca_pml_ob1_recv_request_get_frag_failed method is responsible for returning
or queueing the fragment but mca_pml_ob1_rget_completion was freeing it
unconditionally. This will lead to a double return of the fragment to the free
list and may lead to other errors if the fragment was queued for retry. This
commit fixes the issue by only returning the fragment if it did not fail.

Signed-off-by: Nathan Hjelm <[email protected]>
  • Loading branch information
hjelmn committed Sep 19, 2024
1 parent 27efeb9 commit b7f8cae
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions ompi/mca/pml/ob1/pml_ob1_recvreq.c
Original file line number Diff line number Diff line change
Expand Up @@ -413,6 +413,7 @@ static void mca_pml_ob1_rget_completion (mca_btl_base_module_t* btl, struct mca_
/* check completion status */
if (OPAL_UNLIKELY(OMPI_SUCCESS != status)) {
status = mca_pml_ob1_recv_request_get_frag_failed (frag, status);
/* fragment was returned or queue by the above call */
if (OPAL_UNLIKELY(OMPI_SUCCESS != status)) {
size_t skipped_bytes = recvreq->req_send_offset - recvreq->req_rdma_offset;
opal_output_verbose(mca_pml_ob1_output, 1, "pml:ob1: %s: operation failed with code %d", __func__, status);
Expand All @@ -435,12 +436,12 @@ static void mca_pml_ob1_rget_completion (mca_btl_base_module_t* btl, struct mca_
mca_pml_ob1_send_fin (recvreq->req_recv.req_base.req_proc,
bml_btl, frag->rdma_hdr.hdr_rget.hdr_frag,
frag->rdma_length, 0, 0);

MCA_PML_OB1_RDMA_FRAG_RETURN(frag);
}

recv_request_pml_complete_check(recvreq);

MCA_PML_OB1_RDMA_FRAG_RETURN(frag);

MCA_PML_OB1_PROGRESS_PENDING(bml_btl);
}

Expand Down

0 comments on commit b7f8cae

Please sign in to comment.