Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiencing slow network under VZ virtualization #1333

Closed
Tracked by #1610
terev opened this issue Jan 26, 2023 · 40 comments
Closed
Tracked by #1610

Experiencing slow network under VZ virtualization #1333

terev opened this issue Jan 26, 2023 · 40 comments

Comments

@terev
Copy link
Contributor

terev commented Jan 26, 2023

Description

Lima: HEAD-0da3240
Colima: version=HEAD-cf522e8,driver=vz,mounts=virtiofs
MacOS: Ventura 13.1

I'm running HEAD of lima and colima and I think I'm still seeing something similar to this issue #1285 frequently . The most often occurrence of this issue for me is when pulling many docker images while starting a docker compose project. The command hangs and sometimes times out producing a message like:

Error response from daemon: Head "********************": dial tcp: lookup gcr.io on 192.168.5.3:53: read udp 192.168.5.15:40202->192.168.5.3:53: i/o timeout
Error: exit status 18

In the host agent logs ha.stderr.log I see many of the following message:
{"level":"debug","msg":"Stopping udp proxy (read udp 192.168.5.2:58193: i/o timeout)","time":"2023-01-26T01:12:39-05:00"}

@jandubois
Copy link
Member

Do you have repro instructions you can share, e.g. a docker-compose file using only public images?

Is the problem specific to vz, or does it also happen with qemu?

@terev
Copy link
Contributor Author

terev commented Jan 26, 2023

I'm only able to reproduce on vz . It seems to successfully respond to queries for a while then the Stopping udp proxy messages start showing up. Additionally when stopping this vm the command hangs until terminated and logs:

panic: runtime/cgo: misuse of an invalid Handle

goroutine 17 [running, locked to thread]:
runtime/cgo.Handle.Value(...)
	/opt/homebrew/Cellar/go/1.19.5/libexec/src/runtime/cgo/handle.go:124
github.com/Code-Hex/vz/v3.changeStateOnObserver(0x0, 0x14000002680?)
	/Users/trevorfoster/Library/Caches/Homebrew/go_mod_cache/pkg/mod/github.com/!code-!hex/vz/[email protected]/virtualization.go:160 +0xec

The command hangs but when querying for the status of the vm it thinks it's stopped.

I'm not really able to reproduce using images available publicly. I attempted merging a couple of these https://github.com/docker/awesome-compose together to emulate a lot of images being pulled but unable to repro the super long hanging. However it does seem to be generally slower than qemu. And I'm still seeing those stopping udp proxy messages.

@jandubois
Copy link
Member

Thanks! This looks like it is a separate issue from #1285, as the crash is in the vz code.

@terev terev changed the title Experiencing "Lima get stuck when certain frequency of DNS resolution occurs" on HEAD Experiencing slow network under VZ virtualization Jan 27, 2023
@balajiv113
Copy link
Member

@terev
Is the slow network only related to DNS resolution / even the download speeds are slow ??

Here again, there are 2 different issues,

  1. Actual issue with DNS
  2. Panic on VM Stop (Same as Lima VM hangs and when stopped produces a cgo panic #1299). This panic has nothing to do with the slowness issue

@terev
Copy link
Contributor Author

terev commented Jan 27, 2023

@balajiv113 it does seem like it might just be DNS . here's my small experiment:
VZ

$ colima start issuetest -t vz --mount-type virtiofs
$ colima ssh -p issuetest
$ for i in $(seq 1 10); do curl -s -w 'lookup: %{time_namelookup} total: %{time_total}\n' -o /dev/null https://www.google.com ; done
lookup: 1.009936 total: 1.180874
lookup: 1.014016 total: 1.200459
lookup: 1.012715 total: 1.210918
lookup: 1.014798 total: 1.194656
lookup: 1.010661 total: 1.188050
lookup: 1.014198 total: 1.194364
lookup: 1.012562 total: 1.179951
lookup: 1.010642 total: 1.193922
lookup: 1.006803 total: 1.212509
lookup: 1.013336 total: 1.188702

QEMU

$ colima start issuetest -t qemu
$ colima ssh -p issuetest
$ for i in $(seq 1 10); do curl -s -w 'lookup: %{time_namelookup} total: %{time_total}\n' -o /dev/null https://www.google.com ; done
lookup: 0.011018 total: 0.202526
lookup: 0.003772 total: 0.165436
lookup: 0.003735 total: 0.200770
lookup: 0.003029 total: 0.190208
lookup: 0.005203 total: 0.189349
lookup: 0.001524 total: 0.169970
lookup: 0.001482 total: 0.173584
lookup: 0.005046 total: 0.185870
lookup: 0.002242 total: 0.181922
lookup: 0.003841 total: 0.194537

Let me know if there's anything you'd like from me.

@balajiv113
Copy link
Member

Thanks this helps.

Possible Reason
Looks like gvisor-tap-vsock is doing a forwarding of connections for the guest -> host access. In this DNS case, udp-proxy is being used to forward packets and that might be the reason for this slowness

@AkihiroSuda / @jandubois
Should we use in-build DNS for the vz driver ?? In this case, the requests will be replied by the gateway itself. So no forwarding will happen

VZ with inbuild DNS

lookup: 0.034468 total: 0.503908
lookup: 0.002234 total: 0.505414
lookup: 0.002497 total: 0.451307
lookup: 0.002519 total: 0.448397
lookup: 0.002834 total: 0.503139
lookup: 0.002498 total: 0.501404
lookup: 0.002626 total: 0.502992
lookup: 0.003732 total: 0.434553
lookup: 0.002362 total: 0.472442
lookup: 0.002220 total: 0.500188

@jandubois
Copy link
Member

Should we use in-build DNS for the vz driver ?? In this case, the requests will be replied by the gateway itself.

I think replacing our host resolver with the gvisor implementation is at least the medium term goal. Can we still define static names like host.lima.internal though? The user can define additional names, either as IP addresses or as (CNAME) aliases in the lima.yaml file.

However, is this slowness of the UDP proxy expected? DNS does not cause a lot of traffic, so where does all this latency come from?

@terev
Copy link
Contributor Author

terev commented Jan 28, 2023

I'm very new to networking in linux and the networking architecture implemented by this project but I have a few questions that may be dumb (feel free to ignore). That change appears to bring lookups very close to QEMU performance which seems great. But I'm curious why the http roundtrip is also so much slower than in QEMU? Does this have to do with differences between VZ and QEMU's network setup that facilitates connections from the guest or something to do with the different virtualizations? I would have expected VZ to be the same or better.

@balajiv113
Copy link
Member

slowness of the UDP proxy expected

Not sure,. Currently investigating this. Will update if i find something

Does this have to do with differences between VZ and QEMU's network setup

@terev Exactly this. The way network is developed/configured in thr framework used is entirely different just the working is more/less the same

@balajiv113
Copy link
Member

@terev
Just did a dig on both vz and QEMU. DNS is working as expected actually :)

Difference ?
The actual difference is coming from colima not from lima.
With lima both QEMU and VZ are taking same time (around lookup: 1.009936 total: 1.180874)

Difference in colima
Colima has their own version of gvproxy as network driver (Same framework used in lima vz driver). This is by default used for QEMU profiles but not supported for vz drivers.

When colima uses gvproxy, they are making use of gvproxy in-build DNS (Same one i have mentioned above with improved performance, the requests will be replied by the gateway itself. So no forwarding will happen). That's why colima QEMU is replying faster in comparison to colima VZ

@balajiv113
Copy link
Member

is this slowness of the UDP proxy expected?

@jandubois - i did a performance of guest -> host for UDP, both slirp and gvisor-tap-vsock are performing same. So i would take this as expected slowness

@terev
Copy link
Contributor Author

terev commented Jan 30, 2023

@balajiv113 Hmm so are you saying Colima uses slirp for VZ networking then? And this is the difference because lima uses gvproxy for VZ by default?

@balajiv113
Copy link
Member

@terev Not exactly.

Working in lima
The below is the request flow for both VZ and QEMU

Guest -> google.com -> slirp (for QEMU) / gvisor-tap-vsock (for VZ) -> forwards udp -> lima hostagent resolves DNS

Working in colima

Guest -> google.com -> slirp (for QEMU --network-driver slirp) / gvisor-tap-vsock (for VZ) -> forwards udp -> lima hostagent resolves DNS

QEMU (--network-driver gvproxy)
Guest -> google.com -> slirp / gvisor-tap-vsock -> resolves DNS

So only in colima QEMU with --network-driver gvproxy the performance will be faster as there is no UDP forwarding happening.

@terev
Copy link
Contributor Author

terev commented Jan 30, 2023

@balajiv113 Ohhh ok I think I understand. It seems like forwarding traffic is the ideal choice for DNS but there's a way to respond from within the guest which may help. But for other tcp and udp connections they have a large amount of added latency due to the virtual network forwarding.

Weird that there's so much latency added for forwarded traffic. I assume that's a huge rabbit hole to investigate. Should an issue be opened with the gvisor-tap-vsock project?

@balajiv113
Copy link
Member

@terev As far as i checked the latency is consistent between both frameworks (slirp & gvisor-tap-vsock).

@terev
Copy link
Contributor Author

terev commented Jan 31, 2023

@balajiv113 Oh damn alright that's unfortunate. I'm curious how this compares to Docker Desktop.

Are there downsides to responding to DNS from the gateway? Doing this seems like it'd be worth it since its a significant latency improvement.

@jandubois
Copy link
Member

Are there downsides to responding to DNS from the gateway?

Not really, which is why we should be switching to it. AFAIK the only piece missing right now is to be able to define additional static names (like host.lima.internal, or additional entries from lima.yaml).

@terev
Copy link
Contributor Author

terev commented Feb 6, 2023

Fwiw i just ran the same experiment with lima on 0.14.2 and dns lookups are fast on VZ:

/ # for i in $(seq 1 10); do curl -s -w 'lookup: %{time_namelookup} total: %{time_total}\n' -o /dev/null https://www.google.com ; done
lookup: 0.039751 total: 0.279438
lookup: 0.016078 total: 0.251529
lookup: 0.012076 total: 0.308732
lookup: 0.012782 total: 0.233175
lookup: 0.025466 total: 0.297233
lookup: 0.014166 total: 0.260644
lookup: 0.022045 total: 0.259245
lookup: 0.014491 total: 0.265971
lookup: 0.013092 total: 0.236433
lookup: 0.014356 total: 0.252456

@terev
Copy link
Contributor Author

terev commented May 1, 2023

This still appears to be occurring for us. It seems like the issue appears to get worse the longer the vm is running.

@benh4ckett
Copy link

FWIW, a bunch of people at my company are seeing this under vz. Some people have switched back to qemu (though qemu has its own set of issues). As someone who doesn't really fully understand this issue, I'm not sure if I can be of much help, but if you folks think of anything I can do to help let me know. Maybe I can try to figure out a bash script that reproduces the issue as quickly as possible after a colima restart.

@AkihiroSuda
Copy link
Member

The network stack for vz was updated in #1383 (targeted for v0.16).
Does the issue still occur with this PR?

@terev
Copy link
Contributor Author

terev commented May 4, 2023

Oh neat thanks!

limactl --version
limactl version HEAD-79bc1e1

I added the following to my ~/.lima/_config/networks.yaml file:

networks:
  ...
  user-v2:
    mode: user-v2

And using the following lima config:

cpus: 2
memory: 6GiB
# Example to run lima instance with experimental user-v2 network enabled
images:
  - location: "https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img"
    arch: "x86_64"
  - location: "https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-arm64.img"
    arch: "aarch64"
vmType: "vz"
mountType: "virtiofs"
rosetta:
  # Enable Rosetta for Linux.
  # Hint: try `softwareupdate --install-rosetta` if Lima gets stuck at `Installing rosetta...`
  enabled: true
  # Register rosetta to /proc/sys/fs/binfmt_misc
  binfmt: true
mounts:
  - location: "~"
  - location: "/tmp/lima"
    writable: true
networks:
  - lima: user-v2

But it seems dns lookups are still slow:

for i in $(seq 1 10); do curl -s -w 'lookup: %{time_namelookup} total: %{time_total}\n' -o /dev/null https://www.google.com ; done
lookup: 1.004340 total: 1.277296
lookup: 1.004924 total: 1.275450
lookup: 1.005430 total: 1.288956
lookup: 1.006277 total: 1.269120
lookup: 1.004637 total: 1.247322
lookup: 1.005678 total: 1.274065
lookup: 1.004407 total: 1.272516
lookup: 1.013946 total: 1.270648
lookup: 1.002987 total: 1.278580
lookup: 1.004400 total: 1.263776

@terev
Copy link
Contributor Author

terev commented May 4, 2023

FWIW it seems turning the host resolver off with gvproxy seems to help with latency but after a while we often end up seeing hanging as a result of some network request.

@balajiv113
Copy link
Member

But it seems dns lookups are still slow:

True, as part of new network stack DNS resolution is not yet moved to this new modal. I will provide this support in a follow up soon.

hanging as a result of some network request.

Can you check the same with new network stack in lima like mentioned above.

@terev
Copy link
Contributor Author

terev commented May 4, 2023

Yeah turning the host resolver off for user-v2 shows significant latency improvement too.

for i in $(seq 1 10); do curl -s -w 'lookup: %{time_namelookup} total: %{time_total}\n' -o /dev/null https://www.google.com ; done
lookup: 0.033342 total: 0.290496
lookup: 0.000825 total: 0.266588
lookup: 0.000858 total: 0.270565
lookup: 0.000861 total: 0.268383
lookup: 0.000876 total: 0.277635
lookup: 0.000808 total: 0.242906
lookup: 0.000641 total: 0.260983
lookup: 0.000769 total: 0.239489
lookup: 0.000787 total: 0.247760
lookup: 0.000792 total: 0.241116

@kconley-sq
Copy link

Fwiw i just ran the same experiment with lima on 0.14.2 and dns lookups are fast on VZ:

/ # for i in $(seq 1 10); do curl -s -w 'lookup: %{time_namelookup} total: %{time_total}\n' -o /dev/null https://www.google.com ; done
lookup: 0.039751 total: 0.279438
lookup: 0.016078 total: 0.251529
lookup: 0.012076 total: 0.308732
lookup: 0.012782 total: 0.233175
lookup: 0.025466 total: 0.297233
lookup: 0.014166 total: 0.260644
lookup: 0.022045 total: 0.259245
lookup: 0.014491 total: 0.265971
lookup: 0.013092 total: 0.236433
lookup: 0.014356 total: 0.252456

@terev can you confirm if in #1333 (comment) you meant that this slow DNS resolution issue is totally not reproducible when using a vz VM via Colima if you revert your lima version to 0.14.2?

@herrernst
Copy link

I've just freshly installed lima and started the experimental/vz, still have slow DNS resolution. limactl version 0.17.2

lookup: 1.005215 total: 1.181721
lookup: 1.004708 total: 1.176699
lookup: 1.004669 total: 1.206167
lookup: 1.004570 total: 1.197191

Also seeing this in ha.stderr.log:

{"level":"debug","msg":"Stopping udp proxy (read udp 192.168.5.2:52877: i/o timeout)","time":"2023-08-30T11:33:02+02:00"}
{"level":"debug","msg":"Stopping udp proxy (read udp 192.168.5.2:52877: i/o timeout)","time":"2023-08-30T11:33:02+02:00"}
{"level":"debug","msg":"Stopping udp proxy (read udp 192.168.5.2:52877: i/o timeout)","time":"2023-08-30T11:33:03+02:00"}
{"level":"debug","msg":"Stopping udp proxy (read udp 192.168.5.2:52877: i/o timeout)","time":"2023-08-30T11:33:05+02:00"}
{"level":"debug","msg":"Stopping udp proxy (read udp 192.168.5.2:52877: i/o timeout)","time":"2023-08-30T11:33:06+02:00"}

@terev
Copy link
Contributor Author

terev commented Aug 31, 2023

@herrernst it looks like they're working on a pr that will hopefully address this issue for good. but in the meantime you can improve dns lookup performance by passing a dns config and disabling the host resolver when creating an instance. https://github.com/lima-vm/lima/blob/master/docs/network.md#dns-19216853

@balajiv113
Copy link
Member

This issue should be fixed in latest master.
336c669

Do give a try and let us know if you face similar issues

@terev
Copy link
Contributor Author

terev commented Sep 3, 2023

@balajiv113 Amazing thank you for your work! I'll give it a try soon. Is there any config required or is this the default network/DNS config with the VZ driver?

@balajiv113
Copy link
Member

For vz no config needed it should automatically get fixed to this DNS resolution

@herrernst
Copy link

@balajiv113 Thanks, DNS and network in general seem to be quick now, and also no more udp timeout errors in logs.

@terev
Copy link
Contributor Author

terev commented Sep 5, 2023

@balajiv113 So far I've had some success with this change. When it works it seems awesome. However I've experienced situations where after I start the instance the network works briefly then seems to stop responding. When this occurs I see the following messages in the ha.stderr.log file:

{"level":"info","msg":"Forwarding \"/run/lima-guestagent.sock\" (guest) to \"/Users/trevorfoster/.lima/colima/ga.sock\" (host)","time":"2023-09-05T02:33:14-04:00"}
{"level":"debug","msg":"guest agent info: \u0026{LocalPorts:[{IP:0.0.0.0 Port:22} {IP::: Port:22}]}","time":"2023-09-05T02:33:14-04:00"}
{"level":"debug","msg":"guest agent event: {Time:2023-09-05 06:33:14.359155463 +0000 UTC LocalPortsAdded:[{IP:0.0.0.0 Port:22} {IP::: Port:22}] LocalPortsRemoved:[] Errors:[]}","time":"2023-09-05T02:33:14-04:00"}
{"level":"info","msg":"Not forwarding TCP 0.0.0.0:22","time":"2023-09-05T02:33:14-04:00"}
{"level":"info","msg":"Not forwarding TCP [::]:22","time":"2023-09-05T02:33:14-04:00"}
{"level":"error","msg":"cannot receive packets from , disconnecting: cannot read size from socket: read unixgram -\u003e: use of closed network connection","time":"2023-09-05T02:33:48-04:00"}
{"level":"error","msg":"FD connection closed with errorcannot read size from socket: read unixgram -\u003e: use of closed network connection","time":"2023-09-05T02:33:48-04:00"}
{"level":"error","msg":"write unixgram -\u003e: write: no buffer space available","time":"2023-09-05T02:33:48-04:00"}

The command I'm using to start the vm is colima start --memory 8 --cpu 4 --mount-type virtiofs --vm-type vz --vz-rosetta
Lima Version: limactl version HEAD-de1b3ee
Colima Version:colima version 0.5.5

I noticed that I don't have user-v2: in my ~/.lima/_config/networks.yaml file. Does that matter?

Thank you again for your work.

@balajiv113
Copy link
Member

@terev This looks more of a different one related to network freeze happening in vz #1609.

Also if possible, do share the following details,

  • What kind of program was running in vm, during the network failure
  • Are you using M1 ? Share machine details like cpu, memory

Does that matter?

No it doesn't as i believe you are using the default network only not the user-v2

@terev
Copy link
Contributor Author

terev commented Sep 5, 2023

@balajiv113 Gotcha that does seem to track. I'm using a M1 Macbook 16 GB of memory and 8 cores. I'm starting a large docker compose project which involves pulling lots of docker images. Looking at activity monitor while starting cpu usage is quite high. It gets close to fully using all 4 allocated cores.

@balajiv113
Copy link
Member

I'm starting a large docker compose project which involves pulling lots of docker images

If its sharable please do share. I will give a try with my M1 and M2 and see if that happens.

@terev
Copy link
Contributor Author

terev commented Sep 5, 2023

@balajiv113 Unfortunately I'm unable to share the exact compose project. I was considering trying to reproduce with many projects from https://github.com/docker/awesome-compose though that may run into docker hub rate limits.

@aaronlehmann
Copy link

@balajiv113: I found that after upgrading from lima to 0.17.2 to HEAD-de1b3ee, DNS wouldn't work with existing VMs. I had to delete and recreate the VM for DNS to work at all. Is that expected?

@balajiv113
Copy link
Member

@aaronlehmann
Looks like iptables are not updated properly for old instance. I will check if it can be fixed

@balajiv113
Copy link
Member

@AkihiroSuda
I think this can be closed, as we now use gvisor-tap-vsock DNS resolver for user-v2 network and vz default network. So it will be faster now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants