-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiencing slow network under VZ virtualization #1333
Comments
Do you have repro instructions you can share, e.g. a docker-compose file using only public images? Is the problem specific to vz, or does it also happen with qemu? |
I'm only able to reproduce on vz . It seems to successfully respond to queries for a while then the Stopping udp proxy messages start showing up. Additionally when stopping this vm the command hangs until terminated and logs:
The command hangs but when querying for the status of the vm it thinks it's stopped. I'm not really able to reproduce using images available publicly. I attempted merging a couple of these https://github.com/docker/awesome-compose together to emulate a lot of images being pulled but unable to repro the super long hanging. However it does seem to be generally slower than qemu. And I'm still seeing those stopping udp proxy messages. |
Thanks! This looks like it is a separate issue from #1285, as the crash is in the |
@terev Here again, there are 2 different issues,
|
@balajiv113 it does seem like it might just be DNS . here's my small experiment:
QEMU
Let me know if there's anything you'd like from me. |
Thanks this helps. Possible Reason @AkihiroSuda / @jandubois VZ with inbuild DNS
|
I think replacing our host resolver with the gvisor implementation is at least the medium term goal. Can we still define static names like However, is this slowness of the UDP proxy expected? DNS does not cause a lot of traffic, so where does all this latency come from? |
I'm very new to networking in linux and the networking architecture implemented by this project but I have a few questions that may be dumb (feel free to ignore). That change appears to bring lookups very close to QEMU performance which seems great. But I'm curious why the http roundtrip is also so much slower than in QEMU? Does this have to do with differences between VZ and QEMU's network setup that facilitates connections from the guest or something to do with the different virtualizations? I would have expected VZ to be the same or better. |
Not sure,. Currently investigating this. Will update if i find something
@terev Exactly this. The way network is developed/configured in thr framework used is entirely different just the working is more/less the same |
@terev Difference ? Difference in colima When colima uses gvproxy, they are making use of gvproxy in-build DNS (Same one i have mentioned above with improved performance, |
@jandubois - i did a performance of guest -> host for UDP, both slirp and gvisor-tap-vsock are performing same. So i would take this as expected slowness |
@balajiv113 Hmm so are you saying Colima uses slirp for VZ networking then? And this is the difference because lima uses gvproxy for VZ by default? |
@terev Not exactly. Working in lima Guest -> google.com -> slirp (for QEMU) / gvisor-tap-vsock (for VZ) -> forwards udp -> lima hostagent resolves DNS Working in colima Guest -> google.com -> slirp (for QEMU --network-driver slirp) / gvisor-tap-vsock (for VZ) -> forwards udp -> lima hostagent resolves DNS QEMU (--network-driver gvproxy) So only in colima QEMU with --network-driver gvproxy the performance will be faster as there is no UDP forwarding happening. |
@balajiv113 Ohhh ok I think I understand. It seems like forwarding traffic is the ideal choice for DNS but there's a way to respond from within the guest which may help. But for other tcp and udp connections they have a large amount of added latency due to the virtual network forwarding. Weird that there's so much latency added for forwarded traffic. I assume that's a huge rabbit hole to investigate. Should an issue be opened with the gvisor-tap-vsock project? |
@terev As far as i checked the latency is consistent between both frameworks (slirp & gvisor-tap-vsock). |
@balajiv113 Oh damn alright that's unfortunate. I'm curious how this compares to Docker Desktop. Are there downsides to responding to DNS from the gateway? Doing this seems like it'd be worth it since its a significant latency improvement. |
Not really, which is why we should be switching to it. AFAIK the only piece missing right now is to be able to define additional static names (like |
Fwiw i just ran the same experiment with lima on 0.14.2 and dns lookups are fast on VZ:
|
This still appears to be occurring for us. It seems like the issue appears to get worse the longer the vm is running. |
FWIW, a bunch of people at my company are seeing this under vz. Some people have switched back to qemu (though qemu has its own set of issues). As someone who doesn't really fully understand this issue, I'm not sure if I can be of much help, but if you folks think of anything I can do to help let me know. Maybe I can try to figure out a bash script that reproduces the issue as quickly as possible after a colima restart. |
The network stack for vz was updated in #1383 (targeted for v0.16). |
Oh neat thanks!
I added the following to my ~/.lima/_config/networks.yaml file: networks:
...
user-v2:
mode: user-v2 And using the following lima config: cpus: 2
memory: 6GiB
# Example to run lima instance with experimental user-v2 network enabled
images:
- location: "https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img"
arch: "x86_64"
- location: "https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-arm64.img"
arch: "aarch64"
vmType: "vz"
mountType: "virtiofs"
rosetta:
# Enable Rosetta for Linux.
# Hint: try `softwareupdate --install-rosetta` if Lima gets stuck at `Installing rosetta...`
enabled: true
# Register rosetta to /proc/sys/fs/binfmt_misc
binfmt: true
mounts:
- location: "~"
- location: "/tmp/lima"
writable: true
networks:
- lima: user-v2 But it seems dns lookups are still slow:
|
FWIW it seems turning the host resolver off with gvproxy seems to help with latency but after a while we often end up seeing hanging as a result of some network request. |
True, as part of new network stack DNS resolution is not yet moved to this new modal. I will provide this support in a follow up soon.
Can you check the same with new network stack in lima like mentioned above. |
Yeah turning the host resolver off for user-v2 shows significant latency improvement too.
|
@terev can you confirm if in #1333 (comment) you meant that this slow DNS resolution issue is totally not reproducible when using a |
I've just freshly installed lima and started the
Also seeing this in
|
@herrernst it looks like they're working on a pr that will hopefully address this issue for good. but in the meantime you can improve dns lookup performance by passing a dns config and disabling the host resolver when creating an instance. https://github.com/lima-vm/lima/blob/master/docs/network.md#dns-19216853 |
This issue should be fixed in latest master. Do give a try and let us know if you face similar issues |
@balajiv113 Amazing thank you for your work! I'll give it a try soon. Is there any config required or is this the default network/DNS config with the VZ driver? |
For vz no config needed it should automatically get fixed to this DNS resolution |
@balajiv113 Thanks, DNS and network in general seem to be quick now, and also no more udp timeout errors in logs. |
@balajiv113 So far I've had some success with this change. When it works it seems awesome. However I've experienced situations where after I start the instance the network works briefly then seems to stop responding. When this occurs I see the following messages in the ha.stderr.log file: {"level":"info","msg":"Forwarding \"/run/lima-guestagent.sock\" (guest) to \"/Users/trevorfoster/.lima/colima/ga.sock\" (host)","time":"2023-09-05T02:33:14-04:00"}
{"level":"debug","msg":"guest agent info: \u0026{LocalPorts:[{IP:0.0.0.0 Port:22} {IP::: Port:22}]}","time":"2023-09-05T02:33:14-04:00"}
{"level":"debug","msg":"guest agent event: {Time:2023-09-05 06:33:14.359155463 +0000 UTC LocalPortsAdded:[{IP:0.0.0.0 Port:22} {IP::: Port:22}] LocalPortsRemoved:[] Errors:[]}","time":"2023-09-05T02:33:14-04:00"}
{"level":"info","msg":"Not forwarding TCP 0.0.0.0:22","time":"2023-09-05T02:33:14-04:00"}
{"level":"info","msg":"Not forwarding TCP [::]:22","time":"2023-09-05T02:33:14-04:00"}
{"level":"error","msg":"cannot receive packets from , disconnecting: cannot read size from socket: read unixgram -\u003e: use of closed network connection","time":"2023-09-05T02:33:48-04:00"}
{"level":"error","msg":"FD connection closed with errorcannot read size from socket: read unixgram -\u003e: use of closed network connection","time":"2023-09-05T02:33:48-04:00"}
{"level":"error","msg":"write unixgram -\u003e: write: no buffer space available","time":"2023-09-05T02:33:48-04:00"} The command I'm using to start the vm is I noticed that I don't have user-v2: in my ~/.lima/_config/networks.yaml file. Does that matter? Thank you again for your work. |
@terev This looks more of a different one related to network freeze happening in vz #1609. Also if possible, do share the following details,
No it doesn't as i believe you are using the default network only not the user-v2 |
@balajiv113 Gotcha that does seem to track. I'm using a M1 Macbook 16 GB of memory and 8 cores. I'm starting a large docker compose project which involves pulling lots of docker images. Looking at activity monitor while starting cpu usage is quite high. It gets close to fully using all 4 allocated cores. |
If its sharable please do share. I will give a try with my M1 and M2 and see if that happens. |
@balajiv113 Unfortunately I'm unable to share the exact compose project. I was considering trying to reproduce with many projects from https://github.com/docker/awesome-compose though that may run into docker hub rate limits. |
@balajiv113: I found that after upgrading from lima to 0.17.2 to HEAD-de1b3ee, DNS wouldn't work with existing VMs. I had to delete and recreate the VM for DNS to work at all. Is that expected? |
@aaronlehmann |
@AkihiroSuda |
Description
Lima:
HEAD-0da3240
Colima:
version=HEAD-cf522e8,driver=vz,mounts=virtiofs
MacOS:
Ventura 13.1
I'm running HEAD of lima and colima and I think I'm still seeing something similar to this issue #1285 frequently . The most often occurrence of this issue for me is when pulling many docker images while starting a docker compose project. The command hangs and sometimes times out producing a message like:
In the host agent logs ha.stderr.log I see many of the following message:
{"level":"debug","msg":"Stopping udp proxy (read udp 192.168.5.2:58193: i/o timeout)","time":"2023-01-26T01:12:39-05:00"}
The text was updated successfully, but these errors were encountered: