Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS lookups fail for api.nuget.org using Alpine based dotnet Docker images in AWS us-east-1 #9396

Open
ggatus-atlassian opened this issue Feb 22, 2023 · 20 comments
Assignees
Labels

Comments

@ggatus-atlassian
Copy link

ggatus-atlassian commented Feb 22, 2023

Impact

I'm unable to use api.nuget.org from inside of an Alpine based docker image running in AWS us-east-1.

Describe the bug

We've found an issue where running an Alpine based dotnet image inside of AWS us-east-1 (e.g running an image on an EC2 instance with Docker) causes DNS lookups to api.nuget.org to fail, breaking many tools that integrate with Nuget. Ive noticed this behaviour affecting builds running in Bitbucket Pipelines (our CI/CD service), and have reproduced similar issues directly on EC2. This happens when using Route53 as the DNS resolver (the default when starting up a new EC2 instance).

It appears that the problem is due to Alpine's inability to handle truncated DNS responses. If running dig to perform a DNS lookup for api.nuget.org , we notice the tc flag set in the response headers, indicating a truncated DNS response. The following was executed from an EC2 instance in us-east-1. We've found truncation does not occur in us-west-2. In the below response. we don't receive any A records for api.nuget.org due to the truncated response.

+ dig +noedns +ignore api.nuget.org
; <<>> DiG 9.18.11 <<>> +noedns +ignore api.nuget.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42774
;; flags: qr tc rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;api.nuget.org.			IN	A
;; ANSWER SECTION:
api.nuget.org.		22	IN	CNAME	nugetapiprod.trafficmanager.net.
nugetapiprod.trafficmanager.net. 22 IN	CNAME	apiprod-mscdn.azureedge.net.
apiprod-mscdn.azureedge.net. 300 IN	CNAME	apiprod-mscdn.afd.azureedge.net.
apiprod-mscdn.afd.azureedge.net. 6 IN	CNAME	star-azureedge-prod.trafficmanager.net.
star-azureedge-prod.trafficmanager.net.	55 IN CNAME shed.dual-low.part-0012.t-0009.fdv2-t-msedge.net.
shed.dual-low.part-0012.t-0009.fdv2-t-msedge.net. 172 IN CNAME global-entry-afdthirdparty-fallback-first.trafficmanager.net.
global-entry-afdthirdparty-fallback-first.trafficmanager.net. 49 IN CNAME shed.dual-low.part-0012.t-0009.fb-t-msedge.net.
shed.dual-low.part-0012.t-0009.fb-t-msedge.net.	49 IN CNAME part-0012.t-0009.fb-t-msedge.net.
;; Query time: 0 msec
;; SERVER: 10.30.0.2#53(10.30.0.2) (UDP)
;; WHEN: Wed Feb 22 04:13:48 UTC 2023
;; MSG SIZE  rcvd: 366

Running the same query from us-west-2 gives back a correct response with a A record:

; <<>> DiG 9.18.11 <<>> +noedns +ignore api.nuget.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44616
;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;api.nuget.org.			IN	A
;; ANSWER SECTION:
api.nuget.org.		179	IN	CNAME	nugetapiprod.trafficmanager.net.
nugetapiprod.trafficmanager.net. 179 IN	CNAME	apiprod-mscdn.azureedge.net.
apiprod-mscdn.azureedge.net. 300 IN	CNAME	apiprod-mscdn.afd.azureedge.net.
apiprod-mscdn.afd.azureedge.net. 30 IN	CNAME	star-azureedge-prod.trafficmanager.net.
star-azureedge-prod.trafficmanager.net.	10 IN CNAME shed.dual-low.part-0012.t-0009.fdv2-t-msedge.net.
shed.dual-low.part-0012.t-0009.fdv2-t-msedge.net. 30 IN	CNAME part-0012.t-0009.fdv2-t-msedge.net.
part-0012.t-0009.fdv2-t-msedge.net. 42 IN A	13.107.238.40
part-0012.t-0009.fdv2-t-msedge.net. 42 IN A	13.107.237.40
;; Query time: 0 msec
;; SERVER: 10.30.0.2#53(10.30.0.2) (UDP)
;; WHEN: Wed Feb 22 04:24:17 UTC 2023
;; MSG SIZE  rcvd: 356

This prevents the use of Alpine based docker images running in us-east-1 and using Route53 for DNS services from communicating with nuget. Swapping to an alternative DNS provider such as Cloudflare at 1.1.1.1 or hardcoding api.nuget.org in /etc/hosts resolves the problem. It's unclear if this is a problem with AWS, nuget, or a combination of the two.

Maybe something has changed causing the nuget DNS query responses to increase in size, breaking Alpine? Comparing the above responses from a DNS lookup in us-east-1 vs us-west-2, we see in us-east-1 that there are several additional CNAME entries. Alpine truncates DNS responses that exceed 512 bytes in size (see https://christoph.luppri.ch/fixing-dns-resolution-for-ruby-on-alpine-linux). In this case, we are unable to use any dotnet alpine image to talk to nuget from AWS in us-east-1.

Repro Steps

Steps to reproduce:

  • launch an EC2 instance with Docker installed into AWS us-east-1 region
  • start up any alpine image on the instance
  • Runwget api.nuget.org
  • Observe that hostname resolution fails.

Expected Behavior

We can successfully call api.nuget.org (however it will fail with a http 4xx response without appropriate credentials and path).

Screenshots

No response

Additional Context and logs

We've detected this issue inside of Bitbucket Pipelines, and can reproduce this directly on EC2 instances across unrelated AWS accounts where Route53 is used as a DNS resolver.

@joelverhagen
Copy link
Member

Thanks for the heads up, @ggatus-atlassian! That part of the DNS stack that is differing between region is managed by a dependency of ours (our CDN provider) not the NuGet.org team directly. We'll work with this partner to understand how to mitigate the problem.

⚠️ Note to others that encounter this issue, please post here if you are encountering this problem outside of the AWS us-east-1 region (e.g. from a home/office devbox, a different cloud provider, a different AWS region etc.).

@RiadGahlouz
Copy link
Contributor

@ggatus-atlassian I was able to reproduce the larger response size with a VPN in Virginia but was unable to reproduce the truncated response in an Alpine Linux VM. Have you been able to reproduce the problem outside of an AWS VM (i.e removing Route53 from the picture)?
image

As you can see in the screenshot, the MSG SIZE is 814 which is higher than the limit you described. Do you know if there is a way to verify what the DNS size limit is on a given Linux VM.

@ggatus-atlassian
Copy link
Author

We've been unable to reproduce the problem outside of AWS us-east-1. We've been speaking with AWS support, they have run some tests across us-east-1, us-east-2 and us-west-1 and only found the large CNAME chain being returned in us-east-1. AWS are also performing additional DNS response manipulation and truncation, which complicates this further beyond just issues with Alpine.

This patch/bug details the limitation of libmusl in Alpine, and it's limit of 512B in responses and lack of a fallback mechanism - https://git.musl-libc.org/cgit/musl/commit/src/network/res_msend.c?id=51d4669fb97782f6a66606da852b5afd49a08001

Can you provide some more details of the test environment you are using? What version of Alpine are you running with, and what cloud platform are you running on. If you try running a wget api.nuget.org, does that successfully resolve the host?

The 814 message size above is odd, potentially dig isn't using libmusl, or the version of Alpine you are running with has been patched.

@RiadGahlouz
Copy link
Contributor

@ggatus-atlassian To help us in our investigation, could you provide the packet captures of your DNS resolution requests? (e.g: Using WireShark)

@nslusher-sf
Copy link

nslusher-sf commented Feb 23, 2023

We are also seeing this issue today. Our containers are running in CircleCI - I'm not sure what AWS region they are running in.

We are using the latest .NET SDK Alpine image (mcr.microsoft.com/dotnet/sdk:6.0-alpine) sha256:246295e97b7e11ea713d7436a69b0a2a4102c317da730eefc87e6fb047a0bf7d

~ # dig +noedns +ignore api.nuget.org

; <<>> DiG 9.18.11 <<>> +noedns +ignore api.nuget.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11427
;; flags: qr tc rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;api.nuget.org.			IN	A

;; ANSWER SECTION:
api.nuget.org.		268	IN	CNAME	nugetapiprod.trafficmanager.net.
nugetapiprod.trafficmanager.net. 268 IN	CNAME	apiprod-mscdn.azureedge.net.
apiprod-mscdn.azureedge.net. 181 IN	CNAME	apiprod-mscdn.afd.azureedge.net.
apiprod-mscdn.afd.azureedge.net. 30 IN	CNAME	star-azureedge-prod.trafficmanager.net.
star-azureedge-prod.trafficmanager.net.	44 IN CNAME shed.dual-low.part-0012.t-0009.fdv2-t-msedge.net.
shed.dual-low.part-0012.t-0009.fdv2-t-msedge.net. 233 IN CNAME global-entry-afdthirdparty-fallback-first.trafficmanager.net.
global-entry-afdthirdparty-fallback-first.trafficmanager.net. 60 IN CNAME shed.dual-low.part-0012.t-0009.fb-t-msedge.net.
shed.dual-low.part-0012.t-0009.fb-t-msedge.net.	60 IN CNAME part-0012.t-0009.fb-t-msedge.net.

;; Query time: 8 msec
;; SERVER: 127.0.0.11#53(127.0.0.11) (UDP)
;; WHEN: Thu Feb 23 22:08:30 UTC 2023
;; MSG SIZE  rcvd: 366

It looks like this is an Alpine DNS related issue.
This post sums up what looks to be happening here - truncated UDP DNS responses that are empty. In those cases a retry should be made using TCP but Alpine doesn't currently support TCP DNS queries.

@RichardD012
Copy link

RichardD012 commented Feb 23, 2023

We're seeing this issue specifically from us-east-1 and the mcr.microsoft.com/dotnet/sdk:7.0-alpine3.17 image

; <<>> DiG 9.18.11 <<>> +noedns +ignore api.nuget.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35865
;; flags: qr tc rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;api.nuget.org.			IN	A

;; ANSWER SECTION:
api.nuget.org.		165	IN	CNAME	nugetapiprod.trafficmanager.net.
nugetapiprod.trafficmanager.net. 165 IN	CNAME	apiprod-mscdn.azureedge.net.
apiprod-mscdn.azureedge.net. 300 IN	CNAME	apiprod-mscdn.afd.azureedge.net.
apiprod-mscdn.afd.azureedge.net. 30 IN	CNAME	star-azureedge-prod.trafficmanager.net.
star-azureedge-prod.trafficmanager.net.	10 IN CNAME shed.dual-low.part-0012.t-0009.fdv2-t-msedge.net.
shed.dual-low.part-0012.t-0009.fdv2-t-msedge.net. 163 IN CNAME global-entry-afdthirdparty-fallback-first.trafficmanager.net.
global-entry-afdthirdparty-fallback-first.trafficmanager.net. 48 IN CNAME shed.dual-low.part-0012.t-0009.fb-t-msedge.net.
shed.dual-low.part-0012.t-0009.fb-t-msedge.net.	48 IN CNAME part-0012.t-0009.fb-t-msedge.net.

;; Query time: 3 msec
;; SERVER: 172.24.24.2#53(172.24.24.2) (UDP)
;; WHEN: Thu Feb 23 22:25:29 UTC 2023
;; MSG SIZE  rcvd: 366

This issue first started happening intermittently on Sunday, Feb. 19th.

Edit: Some more detail. We're seeing this on our Github Actions instances. Re-running the failed job with the dig +noedns +ignore api.nuget.org command creates the same output, but the Nuget restore successfully works. The vast majority fail but maybe 10% retry will work, but with the same output as above. We've added curl calls to api.nuget.org and those will subsequently be successful.

@joelverhagen
Copy link
Member

Hey folks, we're continuing the conversation with our CDN provider but they need some additional information. We're also checking if they can root-cause the problem without this additional info (but that's still unclear).

@ggatus-atlassian, @nslusher-sf, @RichardD012 - would it be possible for you to provide a packet capture (pcap file) of the DNS queries that reproduce the problem? The CDN provider mentioned Wireshark but perhaps there are other ways in Alpine Linux.

Also, is it a feasible workaround to override the DNS resolver at a system level to be CloudFlare/Google/OpenDNS's public DNS instead of the Route53 resolver while we're investigating the problem?

@joelverhagen joelverhagen pinned this issue Feb 24, 2023
@joelverhagen joelverhagen changed the title [NuGet.org Bug]: DNS lookups for api.nuget.org using Alpine based dotnet Docker images in AWS us-east-1 fail DNS lookups fail for api.nuget.org using Alpine based dotnet Docker images in AWS us-east-1 Feb 24, 2023
@KondorosiAttila
Copy link

KondorosiAttila commented Feb 24, 2023

image
dotnet restore command still fails with: "Unable to load the service index for source https://api.nuget.org/v3/index.json."

We are using builder image mcr.microsoft.com/dotnet/sdk:6.0-alpine3.13 on Jenkins instances in us-east-1 region

@Boojapho
Copy link

I don't believe this is an issue with AWS, but something odd with the alpine image. On my local machine, I can run docker run -it --rm mcr.microsoft.com/dotnet/sdk:6.0-alpine curl -o /dev/null -w "%{http_code}" -sL https://api.nuget.org/v3/index.json. Most of the time it returns 000 because it cannot resolve the address. Once in a while, I'll get a 200 status code.

If you change the image to mcr.microsoft.com/dotnet/sdk:6.0, you can get a 200 status code every time. I checked .NET 5.0 and got the same results (alpine fails, but non-alpine works).

@joelverhagen
Copy link
Member

@Boojapho - using another non-Alpine image is indeed a workaround. I think the reason Alpine is different in that it has a lower DNS response size limit. Per @ggatus-atlassian in the original post:

Alpine truncates DNS responses that exceed 512 bytes in size (see https://christoph.luppri.ch/fixing-dns-resolution-for-ruby-on-alpine-linux). In this case, we are unable to use any dotnet alpine image to talk to nuget from AWS in us-east-1.

@Boojapho
Copy link

@joelverhagen Excellent point. I dug into it a little more and found that my DNS responses were part of the issue. I modified my DNS source and was able to get a reliable resolution with Alpine.

@mhoeper
Copy link

mhoeper commented Feb 24, 2023

We can see those issue as well since 2 days in Alpine Linux. Sometimes nuget.org gets resolved after retry attempts, but most of the time failing.

@joelverhagen: Is there an update from the CDN provider (MS)? They should be interested in solving this as they provide .NET Alpine Linux image with mcr.microsoft.com/dotnet/sdk:6.0-alpine...

@joelverhagen
Copy link
Member

@mhoeper, could you provide the following information?

  • Are you running the Docker image inside AWS's us-east-1 region?
  • Are you using a CI product like CircleCI/BitBucket Pipelines? If not, where is your Docker host running?
  • Could you provide a packet capture of the DNS resolution (packet capture of NuGet restore including the failed resolves should suffice)? This will help us and our CDN provider understand more deeply what is going on. tcpdump should work.
  • Could you provide the output of this command on the impacted Alpine Vm? dig +noedns +ignore api.nuget.org

As a workaround, while we investigate, you consider using a non-Alpine Docker image (such as mcr.microsoft.com/dotnet/sdk:6.0) or try overriding your DNS resolver to something like CloudClare/Google/OpenDNS instead of the default Route53 (assuming you are running in AWS).

@joelverhagen
Copy link
Member

Folks: we've mitigated the impact by failing over to our secondary CDN provider in Virginia (which is the state where AWS us-east-1 resides). We'll continue to investigate the situation with our primary CDN provider.

If you are still facing issues, please let us know in this thread.

@mhoeper
Copy link

mhoeper commented Feb 24, 2023

We are running docker image mcr.microsoft.com/dotnet/sdk:6.0-alpine in us-east-1 region
We do not run CircleCI/BitBucket Pipelines, instead, Docker host is running on Amazon Linux EC2 instance
As we build our solution inside Docker image, cannot capture DNS resolution easily right now

I think I successfully mitigated this for now, creating a NuGet.Config in the solution folder clearing nuget.org as a package source and adding your CDN apiprod-mscdn.azureedge.net instead. Maybe this works for those that do not want to switch from Alpine because of this.

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <packageSources>
    <clear />  
    <add key="nuget.org" value="https://apiprod-mscdn.azureedge.net/v3/index.json" protocolVersion="3" />
  </packageSources>
</configuration>

@joelverhagen
Copy link
Member

I think I successfully mitigated this for now, creating a NuGet.Config in the solution folder clearing nuget.org as a package source and adding your CDN apiprod-mscdn.azureedge.net instead.

Using an undocumented DNS name can lead to other problems in the future. Additionally, this won't fully work because of how our protocol uses linked data (i.e. there are URLs followed for some scenarios that still reference api.nuget.org because it is baked into the response body).

If this works for you, feel free to do it, but we can't make any guarantees that the DNS name you've used there will keep working forever.

@ilia-cy
Copy link

ilia-cy commented Feb 24, 2023

Folks: we've mitigated the impact by failing over to our secondary CDN provider in Virginia (which is the state where AWS us-east-1 resides). We'll continue to investigate the situation with our primary CDN provider.

If you are still facing issues, please let us know in this thread.

Just ran several CircleCI based pipelines that were failing constantly before and they all passed.
So it seems that the mitigation indeed works.
Thanks!

@mhoeper
Copy link

mhoeper commented Feb 27, 2023

This works for us as well.

@joelverhagen: Looks like Alpine is planning a story retry using tcp if received truncated bit. However, till this is implemented, will Nuget.org try not exceeding the 512 byte limit? Otherwise, we would plan migrating out of Alpine....

@joelverhagen
Copy link
Member

However, till this is implemented, will Nuget.org try not exceeding the 512 byte limit? Otherwise, we would plan migrating out of Alpine....

@mhoeper, it's currently not possible for us to guarantee that the 512 byte limit will not be exceeded. After further conversations with our primary CDN provider, this case occurs when there is "shedding", which appears to be a relatively rare case when the CDN determines it needs to provide an alternate DNS chain, likely due to high load in the area. This would align with the context where the impacted customers are in a highly popular AWS region.

However, given the relatively narrow scope of the impact (Alpine Linux plus AWS regions which encounter CDN shedding), we may need to revert to the previous state if no better solution is found. We've mitigated the current situation by using our secondary CDN provider, which happens to have a smaller DNS response size. But we can't use this solution forever for scalability reasons.

After doing some research online, this seems to be a common problem for Alpine users (not just NuGet.org, not just .NET, not just Docker). I believe the retry over TCP is the proper solution for Alpine, but I can't speak authoritatively since I'm not an expert in musl libc (Alpine's libc implementation which yields this problem) or Alpine's desired use-cases. I also don't know the timeline for the Alpine/musl addressing this problem. It is likely a much longer timeline that we want to be using our secondary CDN provider.

I'll gently suggest moving to a non-Alpine Docker image in the short term to avoid any of these DNS problems. Alpine should probably be fine for runtime cases where NuGet.org is not needed but for SDK cases, it's probably best to avoid Alpine until this issue is resolved one way or another.

We're speaking internally to our partners about alternatives both in the CDN space and in the Docker image configuration. I can't guarantee any solution since these settings are outside of the NuGet.org team's control.

@zhhyu
Copy link
Contributor

zhhyu commented Nov 21, 2023

Hi! @ggatus-atlassian, @nslusher-sf, @RichardD012, @KondorosiAttila, @Boojapho, @mhoeper, @ilia-cy. Our apologies for the inconvenience again! Please take a look at this issue #9736 for the root cause and next steps. Feel free to reach out to us at [email protected] or by commenting on the discussion issue: NuGet/Home#12985. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants