Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align XDP Queues with RSS Better #4484

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Align XDP Queues with RSS Better #4484

wants to merge 15 commits into from

Conversation

nibanks
Copy link
Member

@nibanks nibanks commented Aug 22, 2024

Description

Updates the XDP datapath layer (on Windows) to try to query the RSS processor before fully creating the queues, so that the real queue may be correctly aligned with RSS.

Testing

CI/CD

Documentation

N/A

Copy link

codecov bot commented Aug 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.60%. Comparing base (733da8d) to head (3b66352).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4484      +/-   ##
==========================================
+ Coverage   86.37%   86.60%   +0.22%     
==========================================
  Files          56       56              
  Lines       15537    17354    +1817     
==========================================
+ Hits        13420    15029    +1609     
- Misses       2117     2325     +208     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nibanks nibanks marked this pull request as ready for review August 23, 2024 18:26
@nibanks nibanks requested a review from a team as a code owner August 23, 2024 18:26
}
if (QueueIndex == UINT32_MAX) {
Status = QUIC_STATUS_INVALID_STATE;
QuicTraceEvent(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtfriesen I'm hitting this error in test (definitely over duonic, but also netperf is failing, but I don't have logs). I'm assuming that means we're trying to bind a queue to a processor that doesn't have an RSS queue? I'm trying to decide how to handle this. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More queues than processors in the indirection table is completely legal. I'd just leave any leftover queues unaffinitized and hope for the best.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually thinking it might be the opposite. We have more processors than queues. I actually don't know what duonic does here. Do you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ok, I'm not that familiar with this QUIC code. NICs just indicate packets on the processors in the indirection table - the rest are idle.

}
}
if (!Found) {
CXPLAT_FRE_ASSERT(FALSE); // TODO - What do we do if there is no partition for this processor?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtfriesen thoughts here? We allow an app to specify the CPUs they want us to run on. What do we do if there is an RSS queue that doesn't have a CPU being run on?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question you are asking here is the question I answered in the incorrect location above - a queue without an assigned processor is effectively not a queue.

if (0 == uReturn) {
uint32_t Flags = XSK_BIND_FLAG_TX;
Status = Xdp->XdpApi->XskBind(TxXsk, InterfaceIndex, i, Flags);
if (QUIC_FAILED(Status)) { // No more queues. Break out.
Copy link
Contributor

@mtfriesen mtfriesen Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there could be other errors than just queue count exceeded. probably worth logging the error code and current index explicitly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want me to just log this one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and the "activate" are the only two that might fail in nontrivial and unpredictable ways. it's probably worth logging them all if you have time.

free(rssTable);
}
XSK_NOTIFY_RESULT_FLAGS OutFlags;
Status = Xdp->XdpApi->XskNotifySocket(TxXsk, XSK_NOTIFY_FLAG_POKE_TX|XSK_NOTIFY_FLAG_WAIT_TX, 1000, &OutFlags);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1000 magic number - should be #define ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


Cleanup:
const CXPLAT_PROCESSOR_GROUP_INFO* Group = &CxPlatProcessorGroupInfo[ProcNumber.Group];
Queues[i] = Group->Offset + (ProcNumber.Number % Group->Count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the modulo necessary? i probably just need to go read more of the QUIC code, but i would have expected all ProcNumber.Numbers to be less than Group->Count.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it shouldn't be necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it for hot add? where QUIC has a fixed set of CPUs per group that might see more CPUs added to the group later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we fix this list at process start.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so if the group is fixed at size 1 during process start, and the OS later adds proc 2 to the group, then that's why you have the modulo?

@@ -480,6 +372,9 @@ CxPlatDpRawInterfaceInitialize(
Interface->OffloadStatus.Transmit.NetworkLayerXsum = Xdp->SkipXsum;
Interface->Xdp = Xdp;

uint32_t Processors[256]; // TODO - Use max processor count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfinished TODO

@@ -521,6 +427,7 @@ CxPlatDpRawInterfaceInitialize(
for (uint8_t i = 0; i < Interface->QueueCount; i++) {
XDP_QUEUE* Queue = &Interface->Queues[i];

Queue->RssProcessor = (uint16_t)Processors[i]; // TODO - Should memory be aligned with this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in, should all allocations for this queue be aligned to that processor? ideally yes, for sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, this is a larger issue too. For all memory allocation that is meant to be used on a particular process, should we temporarily affinitize the thread, allocate the memory and then revert the affinitization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in user mode, there are fancy APIs to allocate from a specific node, even if you're not currently running on that node. in kernel mode, it's more difficult to allocate pool from an arbitrary node, though i think newer OSes have better APIs. one sec.

that said, XDP itself treats the calling thread as the "ideal" thread when it performs its own allocations for queues, so temporarily setting the affinity would probably solve the problem the most universally and with the least fuss.

Copy link
Contributor

@mtfriesen mtfriesen Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, PoolExtendedParameterNumaNode is supported for ExAllocatePool3. i'd still just set the thread affinity and be done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if we're all on a single NUMA, how much would this actually matter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have UMA (idk if that's a proper term) then yeah, just do nothing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For VMs that are currently affixed to just one of the NUMA nodes, are we sure there's no per-proc memory perf involved? Also, can we be guaranteed that the VM actually knows the NUMA of the HW?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the VM can't do anything to optimize that scenario. It's all hidden and controlled by the host/hypervisor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants