gpu-compute: Fix for failing assertion in GPUCoalescer #518

mysoreanoop · 2023-10-30T08:15:14Z

Commit #200 introduces colaesced atomics to same addr from within a WF; atomicLog was part of it to record the changes on datablock sequentially; However, for non-returning atomics, this log isn't cleared, which manifests as a failing assertion.

Fixes issue #508, adds to PR #397

Change-Id: Ib2aa470120163552488c3e69913d87b70b293c6e

Commit gem5#200 introduces colaesced atomics to same addr from within a WF; atomicLog was part of it to record the changes on datablock sequentially; However, for non-returning atomics, this log isn't cleared, which manifests as a failing assertion. Fixes issue gem5#508, adds to PR gem5#397 Change-Id: Ib2aa470120163552488c3e69913d87b70b293c6e

src/mem/ruby/system/GPUCoalescer.cc

mysoreanoop · 2023-10-30T09:02:11Z

Due to the fallthrough case in the change, with my compiler, I get the implicit-fallthrough warning as an error -- which can be fixed by modifying the SConstruct file like so:

diff --git a/SConstruct b/SConstruct
index e7fa4b53b7..47bed3014b 100755
--- a/SConstruct
+++ b/SConstruct
@@ -452,6 +452,7 @@ for variant_path in variant_paths:
         env.Append(CCFLAGS=['-Werror',
                              '-Wno-error=deprecated-declarations',
                              '-Wno-error=deprecated',
+                             '-Wno-error=implicit-fallthrough',
                             ])

Should I add this change as well, or replicate the statements within the cases?

mattsinc

This seems ok, but would appreciate @abmerop checking.

src/mem/ruby/system/GPUCoalescer.cc

mattsinc · 2023-10-30T15:48:21Z

@DanKouch just making sure you see this.

powerjg · 2023-10-30T17:51:08Z

There should only be 1 commit. Whoever merges should use squash merge!

powerjg · 2023-10-30T17:53:22Z

I assume, but I want to double check... This is unrelated to the memory leak discussed in #514 (#514 (comment)). Please let me know one way or the other.

powerjg

I think this is also related to #449. Should the same fix be applied to Sequencer.cc?

abmerop

We should fix a memory leak while modifying this code. Thanks for the contribution!

src/mem/ruby/system/GPUCoalescer.cc

DanKouch · 2023-10-30T21:12:52Z

Would this conflict with CHI now clearing the atomic log within the protocol here in #514?

mattsinc · 2023-11-01T05:19:54Z

I think this is also related to #449. Should the same fix be applied to Sequencer.cc?

Since this issue is specifically related to GPU atomics, there could be a similar memory leak in the CPU, but it would have to be for something other than GPU atomics. Where do you want us to look in the Sequencer?

mattsinc · 2023-11-01T05:20:39Z

Would this conflict with CHI now clearing the atomic log within the protocol here in #514?

In theory since this commit is about GPU atomics and CHI does not support GPUs, it should be similar concept but unrelated codewise?

abmerop · 2023-11-01T14:36:13Z

I assume, but I want to double check... This is unrelated to the memory leak discussed in #514 (#514 (comment)). Please let me know one way or the other.

I believe it is the same memory leak, so it would have to be handled for other atomic no return requests (I guess in Sequencer?)

Maybe another fix would be to have an additional boolean parameter for DataBlock and WriteMask atomic methods which indicates whether or not to populate the log at all and default to false. The protocols would have to explicitly set to true for atomics with return though which may be a lot of changes. I don't know how many protocols actual use atomics though.

src/mem/ruby/system/GPUCoalescer.cc

mattsinc

@mysoreanoop: this looks good to me, but just making sure you saw @powerjg 's request that you also update Sequencer.cc? I believe that is blocking this patch.

mysoreanoop · 2023-11-07T05:20:12Z

@powerjg -- I'm not sure what's needed in Sequencer.cc

gem5/src/mem/ruby/system/Sequencer.cc

Line 749 in 8b63299

-- this is where we could clear the data->m_atomicLog -- but I'm not sure. Do you have any insight on this?

powerjg · 2023-11-07T18:19:41Z

@abmerop said the following.

I believe it is the same memory leak, so it would have to be handled for other atomic no return requests (I guess in Sequencer?)

If there is a memory leak that is in both Sequencer and in GPUSequencer, it should be fixed in both.

I am not an expert on this code. @BujSet contributed the original atomic log code. If this introduced a memory leak, we need to fix it. #200

Also, it would be great if you could answer this specific question: Is the memory leak resolved in the change the same as the memory leak mentioned in #514. If you need more information about #514, then talk to Giacomo in that PR.

mattsinc · 2023-11-07T18:29:27Z

@abmerop said the following.

I believe it is the same memory leak, so it would have to be handled for other atomic no return requests (I guess in Sequencer?)

If there is a memory leak that is in both Sequencer and in GPUSequencer, it should be fixed in both.

I am not an expert on this code. @BujSet contributed the original atomic log code. If this introduced a memory leak, we need to fix it. #200

Also, it would be great if you could answer this specific question: Is the memory leak resolved in the change the same as the memory leak mentioned in #514. If you need more information about #514, then talk to Giacomo in that PR.

This specific problem will not be fixed by #514 because 514 only updates the CHI protocol, which is not used in GPUs. So there is a separate fix needed here.

I am not sure where exactly @abmerop is envisioning a fix being done in Sequencer.cc as the atomic log -- I assume it's related to

gem5/src/mem/ruby/system/Sequencer.cc

Line 721 in 8b63299

} else if (pkt->isAtomicOp()) {

?

BujSet · 2023-11-07T19:00:12Z

When I wrote the changes to GPUCoalescer, I had noted my assumption that Store and AtomicNoReturn requests follow the same path. However, looking through the VIPER protocol files, it seems that AtomicNoReturn and AtomicReturn requests are treated the same (so logs are created and populated even on AtomicNoReturn requests). And I'm guessing this assumption doesn't hold in other protocols as well.

I think the correct fix will require changes to the cache protocols to treat AtomicReturns and AtomicNoReturns differently, rather than changing the GPUCoalescer. As discussed in #508, the RubyRequestType already has the information needed to differentiate these, so I think you would just need to have separate action in the protocols themselves. I think not create the logs if you don't need them is a cleaner solution than creating logs in both cases, and deleting them when not needed.

abmerop · 2023-11-07T21:23:05Z

My suggestion was to modify the WriteMask atomic method to add an additional parameter whether or not the log should be used, with false as the default. The log only seems to be necessary for GPU, so we would set the parameter to true in VIPER. Then there are no changes to CHI, Sequencer, nor GPUCoalescer to fix the leak

mysoreanoop · 2023-11-09T16:01:30Z

Seems like at the cache closest to the core (TCP), the accepted request has a RubyRequestType, and when it passes on to the next cache (TCC), it switches to CoherenceRequestType which does not distinguish between RETURN and NO_RETURN. However, interestingly, it has 2 types: Atomic (used ubiquitously) and AtomicWriteBack -- which is unused. We could rename this for those atomics needing a return (=log), else don't use log. How does that sound?

abmerop · 2023-11-09T16:09:13Z

The GLC bit determines if it is return or no return and I believe that is already propagated, so we could use that

mysoreanoop · 2023-11-09T16:25:57Z

Alright then, I can add in the changes to performAtomic and related functions.

mysoreanoop · 2023-11-09T16:57:56Z

src/mem/ruby/system/GPUCoalescer.cc

@@ -599,6 +599,10 @@ GPUCoalescer::hitCallback(CoalescedRequest* crequest,
                // data response is not needed.
                case RubyRequestType_ATOMIC_NO_RETURN:
                    assert(pkt->isAtomicOp());
+                    log = data.popAtomicLogEntryFront();
+                    delete [] log;
+                    data.setData(pkt->getPtr<uint8_t>(), offset, pkt_size);


@BujSet is this necessary for ATOMIC_NO_RETURN?

I think this would solve the assertion error if you need results quickly, however, this still only addresses the problem after the fact. I started a patch that will fix this issue by not creating the log when it's not needed (i.e. when the request is an AtomicNoReturn). I'm still working on this though since it touches quite a few different files (key issue is that the VIPER messages reuse the CPU message structure which ignores the type of atomic in the incoming RubyRequest, and in the event that the request is a miss the atomic action happens on the response path, so the delineation of what type of atomic request is issued must be preserved throughout the entire trip of the request). Will provide an update when I have a working solution.

Got it, I'll wait for your update and take down this PR then.
I can also help make those changes if you need extra hands; let me know

See my PR here #546. @abmerop, When I was looking through the protocol code for the directories, I noticed that only the SLC bit is set and read. Rather than implementing checks for the GLC bit, I just added some extra types to distinguish the atomic request type. Do you think this would be fine, or would it be preferable to add in the support for the GLC bit set and checks?

Since you've already done the work, I think it is cleaner not to pass the GLC bit beyond TCC as it isn't needed otherwise. I thought it was already passed up hence my earlier comment

Cool, I will close this PR for now. @BujSet I meant to ask if line 604 above is necessary for the NO_RETURN case; if not, you could add a break statement.

Oh. I'm not sure. In line with the comment, I just followed the same format as what happens for a store. But thinking about more, maybe setData doesn't need to be called for either store or atomic no return requests. Will look into this.

mysoreanoop · 2023-11-09T20:44:14Z

Closing in favor of #546.

mysoreanoop commented Oct 30, 2023

View reviewed changes

src/mem/ruby/system/GPUCoalescer.cc Show resolved Hide resolved

Added missing statement

35cb905

Harshil2107 added the gpu-compute gem5's GPU Compute Code label Oct 30, 2023

mattsinc requested review from mattsinc and abmerop October 30, 2023 15:46

mattsinc self-assigned this Oct 30, 2023

mattsinc previously approved these changes Oct 30, 2023

View reviewed changes

src/mem/ruby/system/GPUCoalescer.cc Show resolved Hide resolved

Harshil2107 linked an issue Oct 30, 2023 that may be closed by this pull request

m_atomicLog not getting cleared #508

Closed

powerjg requested changes Oct 30, 2023

View reviewed changes

abmerop requested changes Oct 30, 2023

View reviewed changes

src/mem/ruby/system/GPUCoalescer.cc Outdated Show resolved Hide resolved

fix memory leak

8b63299

mysoreanoop dismissed mattsinc’s stale review via 8b63299 November 1, 2023 20:13

abmerop reviewed Nov 1, 2023

View reviewed changes

src/mem/ruby/system/GPUCoalescer.cc Show resolved Hide resolved

abmerop approved these changes Nov 1, 2023

View reviewed changes

src/mem/ruby/system/GPUCoalescer.cc Show resolved Hide resolved

mattsinc approved these changes Nov 2, 2023

View reviewed changes

mysoreanoop closed this Nov 9, 2023

mysoreanoop reopened this Nov 9, 2023

mysoreanoop commented Nov 9, 2023

View reviewed changes

mysoreanoop closed this Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu-compute: Fix for failing assertion in GPUCoalescer #518

gpu-compute: Fix for failing assertion in GPUCoalescer #518

mysoreanoop commented Oct 30, 2023

mysoreanoop commented Oct 30, 2023 •

edited

Loading

mattsinc left a comment

mattsinc commented Oct 30, 2023

powerjg commented Oct 30, 2023

powerjg commented Oct 30, 2023

powerjg left a comment

abmerop left a comment

DanKouch commented Oct 30, 2023

mattsinc commented Nov 1, 2023

mattsinc commented Nov 1, 2023

abmerop commented Nov 1, 2023

mattsinc left a comment

mysoreanoop commented Nov 7, 2023

powerjg commented Nov 7, 2023

mattsinc commented Nov 7, 2023

BujSet commented Nov 7, 2023

abmerop commented Nov 7, 2023

mysoreanoop commented Nov 9, 2023

abmerop commented Nov 9, 2023

mysoreanoop commented Nov 9, 2023 •

edited

Loading

mysoreanoop Nov 9, 2023

BujSet Nov 9, 2023

mysoreanoop Nov 9, 2023

BujSet Nov 9, 2023

abmerop Nov 9, 2023

mysoreanoop Nov 9, 2023

BujSet Nov 9, 2023

mysoreanoop commented Nov 9, 2023

gpu-compute: Fix for failing assertion in GPUCoalescer #518

gpu-compute: Fix for failing assertion in GPUCoalescer #518

Conversation

mysoreanoop commented Oct 30, 2023

mysoreanoop commented Oct 30, 2023 • edited Loading

mattsinc left a comment

Choose a reason for hiding this comment

mattsinc commented Oct 30, 2023

powerjg commented Oct 30, 2023

powerjg commented Oct 30, 2023

powerjg left a comment

Choose a reason for hiding this comment

abmerop left a comment

Choose a reason for hiding this comment

DanKouch commented Oct 30, 2023

mattsinc commented Nov 1, 2023

mattsinc commented Nov 1, 2023

abmerop commented Nov 1, 2023

mattsinc left a comment

Choose a reason for hiding this comment

mysoreanoop commented Nov 7, 2023

powerjg commented Nov 7, 2023

mattsinc commented Nov 7, 2023

BujSet commented Nov 7, 2023

abmerop commented Nov 7, 2023

mysoreanoop commented Nov 9, 2023

abmerop commented Nov 9, 2023

mysoreanoop commented Nov 9, 2023 • edited Loading

mysoreanoop Nov 9, 2023

Choose a reason for hiding this comment

BujSet Nov 9, 2023

Choose a reason for hiding this comment

mysoreanoop Nov 9, 2023

Choose a reason for hiding this comment

BujSet Nov 9, 2023

Choose a reason for hiding this comment

abmerop Nov 9, 2023

Choose a reason for hiding this comment

mysoreanoop Nov 9, 2023

Choose a reason for hiding this comment

BujSet Nov 9, 2023

Choose a reason for hiding this comment

mysoreanoop commented Nov 9, 2023

mysoreanoop commented Oct 30, 2023 •

edited

Loading

mysoreanoop commented Nov 9, 2023 •

edited

Loading