Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu-compute: Fix for failing assertion in GPUCoalescer #518

Closed
wants to merge 3 commits into from

Conversation

mysoreanoop
Copy link
Contributor

Commit #200 introduces colaesced atomics to same addr from within a WF; atomicLog was part of it to record the changes on datablock sequentially; However, for non-returning atomics, this log isn't cleared, which manifests as a failing assertion.

Fixes issue #508, adds to PR #397

Change-Id: Ib2aa470120163552488c3e69913d87b70b293c6e

Commit gem5#200 introduces colaesced atomics to same addr from within a WF;
atomicLog was part of it to record the changes on datablock sequentially;
However, for non-returning atomics, this log isn't cleared, which manifests as a failing assertion.

Fixes issue gem5#508, adds to PR gem5#397

Change-Id: Ib2aa470120163552488c3e69913d87b70b293c6e
@mysoreanoop
Copy link
Contributor Author

mysoreanoop commented Oct 30, 2023

Due to the fallthrough case in the change, with my compiler, I get the implicit-fallthrough warning as an error -- which can be fixed by modifying the SConstruct file like so:

diff --git a/SConstruct b/SConstruct
index e7fa4b53b7..47bed3014b 100755
--- a/SConstruct
+++ b/SConstruct
@@ -452,6 +452,7 @@ for variant_path in variant_paths:
         env.Append(CCFLAGS=['-Werror',
                              '-Wno-error=deprecated-declarations',
                              '-Wno-error=deprecated',
+                             '-Wno-error=implicit-fallthrough',
                             ])

Should I add this change as well, or replicate the statements within the cases?

@Harshil2107 Harshil2107 added the gpu-compute gem5's GPU Compute Code label Oct 30, 2023
@mattsinc mattsinc self-assigned this Oct 30, 2023
mattsinc
mattsinc previously approved these changes Oct 30, 2023
Copy link
Contributor

@mattsinc mattsinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems ok, but would appreciate @abmerop checking.

src/mem/ruby/system/GPUCoalescer.cc Show resolved Hide resolved
@mattsinc
Copy link
Contributor

@DanKouch just making sure you see this.

@Harshil2107 Harshil2107 linked an issue Oct 30, 2023 that may be closed by this pull request
@powerjg
Copy link
Contributor

powerjg commented Oct 30, 2023

There should only be 1 commit. Whoever merges should use squash merge!

@powerjg
Copy link
Contributor

powerjg commented Oct 30, 2023

I assume, but I want to double check... This is unrelated to the memory leak discussed in #514 (#514 (comment)). Please let me know one way or the other.

Copy link
Contributor

@powerjg powerjg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also related to #449. Should the same fix be applied to Sequencer.cc?

Copy link
Member

@abmerop abmerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fix a memory leak while modifying this code. Thanks for the contribution!

src/mem/ruby/system/GPUCoalescer.cc Outdated Show resolved Hide resolved
@DanKouch
Copy link
Contributor

Would this conflict with CHI now clearing the atomic log within the protocol here in #514?

@mattsinc
Copy link
Contributor

mattsinc commented Nov 1, 2023

I think this is also related to #449. Should the same fix be applied to Sequencer.cc?

Since this issue is specifically related to GPU atomics, there could be a similar memory leak in the CPU, but it would have to be for something other than GPU atomics. Where do you want us to look in the Sequencer?

@mattsinc
Copy link
Contributor

mattsinc commented Nov 1, 2023

Would this conflict with CHI now clearing the atomic log within the protocol here in #514?

In theory since this commit is about GPU atomics and CHI does not support GPUs, it should be similar concept but unrelated codewise?

@abmerop
Copy link
Member

abmerop commented Nov 1, 2023

I assume, but I want to double check... This is unrelated to the memory leak discussed in #514 (#514 (comment)). Please let me know one way or the other.

I believe it is the same memory leak, so it would have to be handled for other atomic no return requests (I guess in Sequencer?)

Maybe another fix would be to have an additional boolean parameter for DataBlock and WriteMask atomic methods which indicates whether or not to populate the log at all and default to false. The protocols would have to explicitly set to true for atomics with return though which may be a lot of changes. I don't know how many protocols actual use atomics though.

src/mem/ruby/system/GPUCoalescer.cc Show resolved Hide resolved
Copy link
Contributor

@mattsinc mattsinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mysoreanoop: this looks good to me, but just making sure you saw @powerjg 's request that you also update Sequencer.cc? I believe that is blocking this patch.

@mysoreanoop
Copy link
Contributor Author

@powerjg -- I'm not sure what's needed in Sequencer.cc

-- this is where we could clear the data->m_atomicLog -- but I'm not sure. Do you have any insight on this?

@powerjg
Copy link
Contributor

powerjg commented Nov 7, 2023

@abmerop said the following.

I believe it is the same memory leak, so it would have to be handled for other atomic no return requests (I guess in Sequencer?)

If there is a memory leak that is in both Sequencer and in GPUSequencer, it should be fixed in both.

I am not an expert on this code. @BujSet contributed the original atomic log code. If this introduced a memory leak, we need to fix it. #200

Also, it would be great if you could answer this specific question: Is the memory leak resolved in the change the same as the memory leak mentioned in #514. If you need more information about #514, then talk to Giacomo in that PR.

@mattsinc
Copy link
Contributor

mattsinc commented Nov 7, 2023

@abmerop said the following.

I believe it is the same memory leak, so it would have to be handled for other atomic no return requests (I guess in Sequencer?)

If there is a memory leak that is in both Sequencer and in GPUSequencer, it should be fixed in both.

I am not an expert on this code. @BujSet contributed the original atomic log code. If this introduced a memory leak, we need to fix it. #200

Also, it would be great if you could answer this specific question: Is the memory leak resolved in the change the same as the memory leak mentioned in #514. If you need more information about #514, then talk to Giacomo in that PR.

This specific problem will not be fixed by #514 because 514 only updates the CHI protocol, which is not used in GPUs. So there is a separate fix needed here.

I am not sure where exactly @abmerop is envisioning a fix being done in Sequencer.cc as the atomic log -- I assume it's related to

} else if (pkt->isAtomicOp()) {
?

@BujSet
Copy link
Contributor

BujSet commented Nov 7, 2023

When I wrote the changes to GPUCoalescer, I had noted my assumption that Store and AtomicNoReturn requests follow the same path. However, looking through the VIPER protocol files, it seems that AtomicNoReturn and AtomicReturn requests are treated the same (so logs are created and populated even on AtomicNoReturn requests). And I'm guessing this assumption doesn't hold in other protocols as well.

I think the correct fix will require changes to the cache protocols to treat AtomicReturns and AtomicNoReturns differently, rather than changing the GPUCoalescer. As discussed in #508, the RubyRequestType already has the information needed to differentiate these, so I think you would just need to have separate action in the protocols themselves. I think not create the logs if you don't need them is a cleaner solution than creating logs in both cases, and deleting them when not needed.

@abmerop
Copy link
Member

abmerop commented Nov 7, 2023

My suggestion was to modify the WriteMask atomic method to add an additional parameter whether or not the log should be used, with false as the default. The log only seems to be necessary for GPU, so we would set the parameter to true in VIPER. Then there are no changes to CHI, Sequencer, nor GPUCoalescer to fix the leak

@mysoreanoop
Copy link
Contributor Author

Seems like at the cache closest to the core (TCP), the accepted request has a RubyRequestType, and when it passes on to the next cache (TCC), it switches to CoherenceRequestType which does not distinguish between RETURN and NO_RETURN. However, interestingly, it has 2 types: Atomic (used ubiquitously) and AtomicWriteBack -- which is unused. We could rename this for those atomics needing a return (=log), else don't use log. How does that sound?

@abmerop
Copy link
Member

abmerop commented Nov 9, 2023

The GLC bit determines if it is return or no return and I believe that is already propagated, so we could use that

@mysoreanoop
Copy link
Contributor Author

mysoreanoop commented Nov 9, 2023

Alright then, I can add in the changes to performAtomic and related functions.

@mysoreanoop mysoreanoop closed this Nov 9, 2023
@mysoreanoop mysoreanoop reopened this Nov 9, 2023
@@ -599,6 +599,10 @@ GPUCoalescer::hitCallback(CoalescedRequest* crequest,
// data response is not needed.
case RubyRequestType_ATOMIC_NO_RETURN:
assert(pkt->isAtomicOp());
log = data.popAtomicLogEntryFront();
delete [] log;
data.setData(pkt->getPtr<uint8_t>(), offset, pkt_size);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BujSet is this necessary for ATOMIC_NO_RETURN?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would solve the assertion error if you need results quickly, however, this still only addresses the problem after the fact. I started a patch that will fix this issue by not creating the log when it's not needed (i.e. when the request is an AtomicNoReturn). I'm still working on this though since it touches quite a few different files (key issue is that the VIPER messages reuse the CPU message structure which ignores the type of atomic in the incoming RubyRequest, and in the event that the request is a miss the atomic action happens on the response path, so the delineation of what type of atomic request is issued must be preserved throughout the entire trip of the request). Will provide an update when I have a working solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I'll wait for your update and take down this PR then.
I can also help make those changes if you need extra hands; let me know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my PR here #546. @abmerop, When I was looking through the protocol code for the directories, I noticed that only the SLC bit is set and read. Rather than implementing checks for the GLC bit, I just added some extra types to distinguish the atomic request type. Do you think this would be fine, or would it be preferable to add in the support for the GLC bit set and checks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you've already done the work, I think it is cleaner not to pass the GLC bit beyond TCC as it isn't needed otherwise. I thought it was already passed up hence my earlier comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I will close this PR for now. @BujSet I meant to ask if line 604 above is necessary for the NO_RETURN case; if not, you could add a break statement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I'm not sure. In line with the comment, I just followed the same format as what happens for a store. But thinking about more, maybe setData doesn't need to be called for either store or atomic no return requests. Will look into this.

@mysoreanoop
Copy link
Contributor Author

Closing in favor of #546.

@mysoreanoop mysoreanoop closed this Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpu-compute gem5's GPU Compute Code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

m_atomicLog not getting cleared
7 participants