Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nbft: Always stick to HFI-defined host_traddr #2180

Open
tbzatek opened this issue Jan 5, 2024 · 13 comments
Open

nbft: Always stick to HFI-defined host_traddr #2180

tbzatek opened this issue Jan 5, 2024 · 13 comments

Comments

@tbzatek
Copy link
Contributor

tbzatek commented Jan 5, 2024

nvme-cli/nbft.c

Lines 162 to 190 in dcdad6f

/*
* With TCP/DHCP, it can happen that the OS
* obtains a different local IP address than the
* firmware had. Retry without host_traddr.
*/
if (ret == -1 && errno == ENVME_CONNECT_WRITE &&
!strcmp((*ss)->transport, "tcp") &&
strlen(hfi->tcp_info.dhcp_server_ipaddr) > 0) {
nvme_free_ctrl(c);
trcfg.host_traddr = NULL;
cl = lookup_ctrl(h, &trcfg);
if (cl && nvme_ctrl_get_name(cl))
continue;
c = nvme_create_ctrl(r, (*ss)->subsys_nqn, (*ss)->transport,
(*ss)->traddr,
NULL, NULL, (*ss)->trsvcid);
if (!c) {
errno = ENOMEM;
goto out_free;
}
errno = 0;
ret = nvmf_add_ctrl(h, c, cfg);
if (ret == 0 && verbose >= 1)
fprintf(stderr,
"connect with host_traddr=\"%s\" failed, success after omitting host_traddr\n",
host_traddr);
}

I'm not sure this is 100% correct. The possibility of receiving different IP address between DHCP requests (UEFI vs. linux or simply different DHCP clients) is valid and should be covered.

However, consider the following scenario:

  • two NICs, two IP subnets and corresponding HFIs for each one
  • two SSNSs defined for the isolated subnets and corresponding HFI indexes

Setting proper host_traddr for connection attempts is crucial to indicate the source interface. Once omitted by the mentioned code snippet, connection might be established from a different interface, ignoring the desired HFI. In case of incorrect routing the second target may be silently reached from a different network interface than expected.

This needs properly parsed HFI indexes first, see linux-nvme/libnvme#766

@martin-belanger
Copy link

Hi @tbzatek - Regarding the host_traddr parameter:

Setting proper host_traddr for connection attempts is crucial to indicate the source interface.

For TCP (and TCP only), host_traddr is not used to select the "interface" but to select the "source address" of the connection. One must use host_iface to select the interface. host_traddr is used when an interface has multiple addresses configured (a primary and 1 or more secondary) and we want to prevent the primary address from being used for the connection.

@tbzatek
Copy link
Contributor Author

tbzatek commented Jan 11, 2024

For TCP (and TCP only), host_traddr is not used to select the "interface" but to select the "source address" of the connection. One must use host_iface to select the interface. host_traddr is used when an interface has multiple addresses configured (a primary and 1 or more secondary) and we want to prevent the primary address from being used for the connection.

Ah, thanks, I'm missing the low-level insight really. Still, wouldn't the end result effectively be equal, i.e. the kernel would use a network interface the host_traddr belongs to? Setting host_iface in userspace might involve some extra lookup.

Anyway, I think we'll first need to solve timberland-sig/edk2#29 and linux-nvme/libnvme#766 before fixing the loose end here.

@martin-belanger
Copy link

Still, wouldn't the end result effectively be equal, i.e. the kernel would use a network interface the host_traddr belongs to?

Not quite. When host_iface is not specified, the kernel looks up the destination address (traddr) in the routing table to determine the best interface for the connection. After the interface has been determined, then the kernel will try to set the source address to host_traddr, if specified. Otherwise, it sets the source address to the primary address assigned to that interface. Note that if host_traddr is specified but is invalid (i.e. it does not belong to the selected interface), then the whole operation will fail.

Setting host_iface in userspace might involve some extra lookup.

Correct! If all you have is a source address and you want to use that to force the connection on a specific interface regardless of the routing tables, then you need to find out which interface owns that address and use it with host_iface (check nvme_iface_matching_addr() for an example).

@mwilck
Copy link
Contributor

mwilck commented Jan 12, 2024

I don't we should completely remove the fallback code. Doing that would strongly increase the risk of boot failures, which is never a good thing.

However, I would suggest to walk the list of HFIs and use the fallback (without host_iface / host_traddr) only if connect attempts through all HFIs have failed.
That would currently come down to falling back on every failed connection, because we haven't seen a correctly populated secondary HFI list so far.

This would only work if multipath was configured in the NBFT via secondary HFI association. In theory, the NVMe boot spec allows 3 ways to set up multipath:

  1. secondary HFI association,
  2. multiple SNSS entries referring to different HFIs via the primary HFI field (this is the most likely setup in practice these days),
  3. multiple NBFT tables listing the same SNSS with different HFIs.

My proposed suggestion would cover only case 1. Covering the other cases would be complicated because we'd need to collect all NBFT data and do "deduplication" on them somehow before attempting to connect.

@tbzatek
Copy link
Contributor Author

tbzatek commented Jan 23, 2024

When host_iface is not specified, the kernel looks up the destination address (traddr) in the routing table to determine the best interface for the connection. After the interface has been determined, then the kernel will try to set the source address to host_traddr, if specified. Otherwise, it sets the source address to the primary address assigned to that interface.

Thanks Martin, useful insight! In our case, the host_traddr should equal to the primary address of the network interface. At least for static IP address assignments.

In case of DHCP, the HFI IP address (as being assigned to the EFI DHCP client) may differ from what's actually assigned to the particular network interface in OS stage (due to DHCP client differences and different identifiers sent out). We don't want the kernel host driver to look up a different interface that is best for the traddr, we want to stick to the selected interface (as defined by HFI), only need to omit the specified host_traddr if it doesn't match what's actually available in the system. We want to let the connection fail rather than letting the kernel to select a foreign interface (e.g. due to messed-up routing) and keep user in a false sense of security (multipath).

At least that's my proposal, if it makes sense.

Note that if host_traddr is specified but is invalid (i.e. it does not belong to the selected interface), then the whole operation will fail.

Yes, this is what I'm currently seeing due to the mixed-up HFI indexes in the NBFT table, i.e. wrong host_traddr specified from a foreign HFI/network interface. Caused by the above mentioned bugs.

Setting host_iface in userspace might involve some extra lookup.

Correct! If all you have is a source address and you want to use that to force the connection on a specific interface regardless of the routing tables, then you need to find out which interface owns that address and use it with host_iface (check nvme_iface_matching_addr() for an example).

Nice, I think this might be the way to go.

@tbzatek
Copy link
Contributor Author

tbzatek commented Jan 23, 2024

I don't we should completely remove the fallback code. Doing that would strongly increase the risk of boot failures, which is never a good thing.

Agree, the reason for the fallback is still valid and needed.

However, I would suggest to walk the list of HFIs and use the fallback (without host_iface / host_traddr) only if connect attempts through all HFIs have failed.

Agree with moving the fallback at the end, after all regular attempts. I would still like to keep the host_iface constraint, see above. Subject to further discussion.

That would currently come down to falling back on every failed connection, because we haven't seen a correctly populated secondary HFI list so far.

Yes, it will currently fail until the firmware and the libnvme sides are fixed. I would suggest to postpone the proper fix to nvme-cli until everything else is fixed.

This would only work if multipath was configured in the NBFT via secondary HFI association. In theory, the NVMe boot spec allows 3 ways to set up multipath:
1. secondary HFI association,
2. multiple SNSS entries referring to different HFIs via the primary HFI field (this is the most likely setup in practice these days),
3. multiple NBFT tables listing the same SNSS with different HFIs.

My proposed suggestion would cover only case 1. Covering the other cases would be complicated because we'd need to collect all NBFT data and do "deduplication" on them somehow before attempting to connect.

Right, I think what I have been talking about so far was for the case 2. We'll need to find a fix that suits both scenarios.

@stuarthayes
Copy link
Contributor

Setting host_iface in userspace might involve some extra lookup.

Correct! If all you have is a source address and you want to use that to force the connection on a specific interface regardless of the routing tables, then you need to find out which interface owns that address and use it with host_iface (check nvme_iface_matching_addr() for an example).

Nice, I think this might be the way to go.

Maybe we use the other information in the HFI TCP transport info descriptor (MAC address and/or PCI address) to look up and set host_iface in discover_from_nbft()?

@mwilck
Copy link
Contributor

mwilck commented Jan 29, 2024

Maybe we use the other information in the HFI TCP transport info descriptor (MAC address and/or PCI address) to look up and set host_iface in discover_from_nbft()?

Yes, we should do that. It matters only for multipath scenarios, and I believe that there will be only a few (corner) cases where this makes an actual difference, but if we want to be correct, we have to do it.

@martin-belanger has pointed out that libnvme already has the logic to do this kind of interface matching when opening connections, so I would assume that this already works. Or am I missing something?

@mwilck
Copy link
Contributor

mwilck commented Feb 1, 2024

Additional remarks based on a recent SUSE bug report. My memory has deceived me, I thougt we had made more progress in the meantime.

The host_traddr field is basically useless in NBFT/DHCP scenarios. The only clean way to solve this would be an implementation of TP8027 on both the Firmware and OS sides, which I don't expect in the near future.

The current state is:

  • EDK2 doesn't implement DHCP option 61; it is thus not an RFC4361 compliant DHCP client 1.
  • wicked uses a "stable DUID-LLT" (RFC3315) by default (this default can be changed using global options). The DUID-LLT is generated during first boot, saved on disk, and used from that time onward. wicked duid create uuid --product-id --update can be used to switch to a DUID-UUID based on the DMI product ID.
  • NetworkManager uses complex logic for setting the DHCP client ID involving various global and connection-specific configuration parameters, but AFAICS it can't be configured to use a DUID based on the DMI ID (/etc/machine-id only).

IOW, for DHCP configured interfaces at least, we might as well ignore the host_traddr field completely, as the IP address will almost certainly be different between FW and OS.

Footnotes

  1. Note that option 61 predates RFC 4361. It was already part of RFC 2131 in 1997.

@tbzatek
Copy link
Contributor Author

tbzatek commented Feb 22, 2024

Maybe we use the other information in the HFI TCP transport info descriptor (MAC address and/or PCI address) to look up and set host_iface in discover_from_nbft()?

Agree, this looks like a clean way.

The host_traddr field is basically useless in NBFT/DHCP scenarios. The only clean way to solve this would be an implementation of TP8027 on both the Firmware and OS sides, which I don't expect in the near future.

Right, as the address might be absent in the system, resolving corresponding interface might fail. In my original case, I still had the address matched, just mixed up due to wrong HFI indexes.

IOW, for DHCP configured interfaces at least, we might as well ignore the host_traddr field completely, as the IP address will almost certainly be different between FW and OS.

So until TP8027 is ready for use, do we agree to go with resolving host_iface from HFI MAC in the meantime?

I still think this won't currently work with EDK2 due to mixed HFI indexes (and also timberland-sig/edk2#29), but this appears to be correct with Dell firmware from what I've seen so far.

@tbzatek
Copy link
Contributor Author

tbzatek commented Apr 25, 2024

FYI, for the discovery code in #2315 I retained the original workaround for the time being. There are now three places with the fallback.

@igaw
Copy link
Collaborator

igaw commented Jul 12, 2024

Good to close?

@tbzatek
Copy link
Contributor Author

tbzatek commented Jul 12, 2024

Good to close?

Nope, this has not been implemented yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants