-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu-compute: Fix for failing assertion in GPUCoalescer #518
Conversation
Commit gem5#200 introduces colaesced atomics to same addr from within a WF; atomicLog was part of it to record the changes on datablock sequentially; However, for non-returning atomics, this log isn't cleared, which manifests as a failing assertion. Fixes issue gem5#508, adds to PR gem5#397 Change-Id: Ib2aa470120163552488c3e69913d87b70b293c6e
Due to the fallthrough case in the change, with my compiler, I get the implicit-fallthrough warning as an error -- which can be fixed by modifying the SConstruct file like so:
Should I add this change as well, or replicate the statements within the cases? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems ok, but would appreciate @abmerop checking.
@DanKouch just making sure you see this. |
There should only be 1 commit. Whoever merges should use squash merge! |
I assume, but I want to double check... This is unrelated to the memory leak discussed in #514 (#514 (comment)). Please let me know one way or the other. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is also related to #449. Should the same fix be applied to Sequencer.cc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should fix a memory leak while modifying this code. Thanks for the contribution!
Since this issue is specifically related to GPU atomics, there could be a similar memory leak in the CPU, but it would have to be for something other than GPU atomics. Where do you want us to look in the Sequencer? |
I believe it is the same memory leak, so it would have to be handled for other atomic no return requests (I guess in Sequencer?) Maybe another fix would be to have an additional boolean parameter for DataBlock and WriteMask atomic methods which indicates whether or not to populate the log at all and default to false. The protocols would have to explicitly set to true for atomics with return though which may be a lot of changes. I don't know how many protocols actual use atomics though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mysoreanoop: this looks good to me, but just making sure you saw @powerjg 's request that you also update Sequencer.cc? I believe that is blocking this patch.
@powerjg -- I'm not sure what's needed in Sequencer.cc gem5/src/mem/ruby/system/Sequencer.cc Line 749 in 8b63299 |
@abmerop said the following.
If there is a memory leak that is in both Sequencer and in GPUSequencer, it should be fixed in both. I am not an expert on this code. @BujSet contributed the original atomic log code. If this introduced a memory leak, we need to fix it. #200 Also, it would be great if you could answer this specific question: Is the memory leak resolved in the change the same as the memory leak mentioned in #514. If you need more information about #514, then talk to Giacomo in that PR. |
This specific problem will not be fixed by #514 because 514 only updates the CHI protocol, which is not used in GPUs. So there is a separate fix needed here. I am not sure where exactly @abmerop is envisioning a fix being done in Sequencer.cc as the atomic log -- I assume it's related to gem5/src/mem/ruby/system/Sequencer.cc Line 721 in 8b63299
|
When I wrote the changes to GPUCoalescer, I had noted my assumption that Store and AtomicNoReturn requests follow the same path. However, looking through the VIPER protocol files, it seems that AtomicNoReturn and AtomicReturn requests are treated the same (so logs are created and populated even on AtomicNoReturn requests). And I'm guessing this assumption doesn't hold in other protocols as well. I think the correct fix will require changes to the cache protocols to treat AtomicReturns and AtomicNoReturns differently, rather than changing the GPUCoalescer. As discussed in #508, the RubyRequestType already has the information needed to differentiate these, so I think you would just need to have separate action in the protocols themselves. I think not create the logs if you don't need them is a cleaner solution than creating logs in both cases, and deleting them when not needed. |
My suggestion was to modify the WriteMask atomic method to add an additional parameter whether or not the log should be used, with false as the default. The log only seems to be necessary for GPU, so we would set the parameter to true in VIPER. Then there are no changes to CHI, Sequencer, nor GPUCoalescer to fix the leak |
Seems like at the cache closest to the core (TCP), the accepted request has a RubyRequestType, and when it passes on to the next cache (TCC), it switches to CoherenceRequestType which does not distinguish between RETURN and NO_RETURN. However, interestingly, it has 2 types: Atomic (used ubiquitously) and AtomicWriteBack -- which is unused. We could rename this for those atomics needing a return (=log), else don't use log. How does that sound? |
The GLC bit determines if it is return or no return and I believe that is already propagated, so we could use that |
Alright then, I can add in the changes to |
@@ -599,6 +599,10 @@ GPUCoalescer::hitCallback(CoalescedRequest* crequest, | |||
// data response is not needed. | |||
case RubyRequestType_ATOMIC_NO_RETURN: | |||
assert(pkt->isAtomicOp()); | |||
log = data.popAtomicLogEntryFront(); | |||
delete [] log; | |||
data.setData(pkt->getPtr<uint8_t>(), offset, pkt_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BujSet is this necessary for ATOMIC_NO_RETURN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would solve the assertion error if you need results quickly, however, this still only addresses the problem after the fact. I started a patch that will fix this issue by not creating the log when it's not needed (i.e. when the request is an AtomicNoReturn). I'm still working on this though since it touches quite a few different files (key issue is that the VIPER messages reuse the CPU message structure which ignores the type of atomic in the incoming RubyRequest, and in the event that the request is a miss the atomic action happens on the response path, so the delineation of what type of atomic request is issued must be preserved throughout the entire trip of the request). Will provide an update when I have a working solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I'll wait for your update and take down this PR then.
I can also help make those changes if you need extra hands; let me know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my PR here #546. @abmerop, When I was looking through the protocol code for the directories, I noticed that only the SLC bit is set and read. Rather than implementing checks for the GLC bit, I just added some extra types to distinguish the atomic request type. Do you think this would be fine, or would it be preferable to add in the support for the GLC bit set and checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you've already done the work, I think it is cleaner not to pass the GLC bit beyond TCC as it isn't needed otherwise. I thought it was already passed up hence my earlier comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I will close this PR for now. @BujSet I meant to ask if line 604 above is necessary for the NO_RETURN case; if not, you could add a break statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I'm not sure. In line with the comment, I just followed the same format as what happens for a store. But thinking about more, maybe setData doesn't need to be called for either store or atomic no return requests. Will look into this.
Closing in favor of #546. |
Commit #200 introduces colaesced atomics to same addr from within a WF; atomicLog was part of it to record the changes on datablock sequentially; However, for non-returning atomics, this log isn't cleared, which manifests as a failing assertion.
Fixes issue #508, adds to PR #397
Change-Id: Ib2aa470120163552488c3e69913d87b70b293c6e