Experiencing slow network under VZ virtualization #1333

terev · 2023-01-26T16:29:15Z

Description

Lima: HEAD-0da3240
Colima: version=HEAD-cf522e8,driver=vz,mounts=virtiofs
MacOS: Ventura 13.1

I'm running HEAD of lima and colima and I think I'm still seeing something similar to this issue #1285 frequently . The most often occurrence of this issue for me is when pulling many docker images while starting a docker compose project. The command hangs and sometimes times out producing a message like:

Error response from daemon: Head "********************": dial tcp: lookup gcr.io on 192.168.5.3:53: read udp 192.168.5.15:40202->192.168.5.3:53: i/o timeout
Error: exit status 18

In the host agent logs ha.stderr.log I see many of the following message:
{"level":"debug","msg":"Stopping udp proxy (read udp 192.168.5.2:58193: i/o timeout)","time":"2023-01-26T01:12:39-05:00"}

The text was updated successfully, but these errors were encountered:

jandubois · 2023-01-26T18:25:59Z

Do you have repro instructions you can share, e.g. a docker-compose file using only public images?

Is the problem specific to vz, or does it also happen with qemu?

terev · 2023-01-26T19:26:43Z

I'm only able to reproduce on vz . It seems to successfully respond to queries for a while then the Stopping udp proxy messages start showing up. Additionally when stopping this vm the command hangs until terminated and logs:

panic: runtime/cgo: misuse of an invalid Handle

goroutine 17 [running, locked to thread]:
runtime/cgo.Handle.Value(...)
	/opt/homebrew/Cellar/go/1.19.5/libexec/src/runtime/cgo/handle.go:124
github.com/Code-Hex/vz/v3.changeStateOnObserver(0x0, 0x14000002680?)
	/Users/trevorfoster/Library/Caches/Homebrew/go_mod_cache/pkg/mod/github.com/!code-!hex/vz/[email protected]/virtualization.go:160 +0xec

The command hangs but when querying for the status of the vm it thinks it's stopped.

I'm not really able to reproduce using images available publicly. I attempted merging a couple of these https://github.com/docker/awesome-compose together to emulate a lot of images being pulled but unable to repro the super long hanging. However it does seem to be generally slower than qemu. And I'm still seeing those stopping udp proxy messages.

jandubois · 2023-01-26T20:01:31Z

Thanks! This looks like it is a separate issue from #1285, as the crash is in the vz code.

balajiv113 · 2023-01-27T06:29:25Z

@terev
Is the slow network only related to DNS resolution / even the download speeds are slow ??

Here again, there are 2 different issues,

Actual issue with DNS
Panic on VM Stop (Same as Lima VM hangs and when stopped produces a cgo panic #1299). This panic has nothing to do with the slowness issue

terev · 2023-01-27T06:56:43Z

@balajiv113 it does seem like it might just be DNS . here's my small experiment:
VZ

$ colima start issuetest -t vz --mount-type virtiofs
$ colima ssh -p issuetest
$ for i in $(seq 1 10); do curl -s -w 'lookup: %{time_namelookup} total: %{time_total}\n' -o /dev/null https://www.google.com ; done
lookup: 1.009936 total: 1.180874
lookup: 1.014016 total: 1.200459
lookup: 1.012715 total: 1.210918
lookup: 1.014798 total: 1.194656
lookup: 1.010661 total: 1.188050
lookup: 1.014198 total: 1.194364
lookup: 1.012562 total: 1.179951
lookup: 1.010642 total: 1.193922
lookup: 1.006803 total: 1.212509
lookup: 1.013336 total: 1.188702

QEMU

$ colima start issuetest -t qemu
$ colima ssh -p issuetest
$ for i in $(seq 1 10); do curl -s -w 'lookup: %{time_namelookup} total: %{time_total}\n' -o /dev/null https://www.google.com ; done
lookup: 0.011018 total: 0.202526
lookup: 0.003772 total: 0.165436
lookup: 0.003735 total: 0.200770
lookup: 0.003029 total: 0.190208
lookup: 0.005203 total: 0.189349
lookup: 0.001524 total: 0.169970
lookup: 0.001482 total: 0.173584
lookup: 0.005046 total: 0.185870
lookup: 0.002242 total: 0.181922
lookup: 0.003841 total: 0.194537

Let me know if there's anything you'd like from me.

balajiv113 · 2023-01-27T11:17:44Z

Thanks this helps.

Possible Reason
Looks like gvisor-tap-vsock is doing a forwarding of connections for the guest -> host access. In this DNS case, udp-proxy is being used to forward packets and that might be the reason for this slowness

@AkihiroSuda / @jandubois
Should we use in-build DNS for the vz driver ?? In this case, the requests will be replied by the gateway itself. So no forwarding will happen

VZ with inbuild DNS

lookup: 0.034468 total: 0.503908
lookup: 0.002234 total: 0.505414
lookup: 0.002497 total: 0.451307
lookup: 0.002519 total: 0.448397
lookup: 0.002834 total: 0.503139
lookup: 0.002498 total: 0.501404
lookup: 0.002626 total: 0.502992
lookup: 0.003732 total: 0.434553
lookup: 0.002362 total: 0.472442
lookup: 0.002220 total: 0.500188

jandubois · 2023-01-27T17:27:09Z

Should we use in-build DNS for the vz driver ?? In this case, the requests will be replied by the gateway itself.

I think replacing our host resolver with the gvisor implementation is at least the medium term goal. Can we still define static names like host.lima.internal though? The user can define additional names, either as IP addresses or as (CNAME) aliases in the lima.yaml file.

However, is this slowness of the UDP proxy expected? DNS does not cause a lot of traffic, so where does all this latency come from?

terev · 2023-01-28T12:53:46Z

I'm very new to networking in linux and the networking architecture implemented by this project but I have a few questions that may be dumb (feel free to ignore). That change appears to bring lookups very close to QEMU performance which seems great. But I'm curious why the http roundtrip is also so much slower than in QEMU? Does this have to do with differences between VZ and QEMU's network setup that facilitates connections from the guest or something to do with the different virtualizations? I would have expected VZ to be the same or better.

balajiv113 · 2023-01-28T16:28:01Z

slowness of the UDP proxy expected

Not sure,. Currently investigating this. Will update if i find something

Does this have to do with differences between VZ and QEMU's network setup

@terev Exactly this. The way network is developed/configured in thr framework used is entirely different just the working is more/less the same

balajiv113 · 2023-01-29T06:55:20Z

@terev
Just did a dig on both vz and QEMU. DNS is working as expected actually :)

Difference ?
The actual difference is coming from colima not from lima.
With lima both QEMU and VZ are taking same time (around lookup: 1.009936 total: 1.180874)

Difference in colima
Colima has their own version of gvproxy as network driver (Same framework used in lima vz driver). This is by default used for QEMU profiles but not supported for vz drivers.

When colima uses gvproxy, they are making use of gvproxy in-build DNS (Same one i have mentioned above with improved performance, the requests will be replied by the gateway itself. So no forwarding will happen). That's why colima QEMU is replying faster in comparison to colima VZ

balajiv113 · 2023-01-29T06:57:20Z

is this slowness of the UDP proxy expected?

@jandubois - i did a performance of guest -> host for UDP, both slirp and gvisor-tap-vsock are performing same. So i would take this as expected slowness

terev · 2023-01-30T05:08:55Z

@balajiv113 Hmm so are you saying Colima uses slirp for VZ networking then? And this is the difference because lima uses gvproxy for VZ by default?

balajiv113 · 2023-01-30T05:19:25Z

@terev Not exactly.

Working in lima
The below is the request flow for both VZ and QEMU

Guest -> google.com -> slirp (for QEMU) / gvisor-tap-vsock (for VZ) -> forwards udp -> lima hostagent resolves DNS

Working in colima

Guest -> google.com -> slirp (for QEMU --network-driver slirp) / gvisor-tap-vsock (for VZ) -> forwards udp -> lima hostagent resolves DNS

QEMU (--network-driver gvproxy)
Guest -> google.com -> slirp / gvisor-tap-vsock -> resolves DNS

So only in colima QEMU with --network-driver gvproxy the performance will be faster as there is no UDP forwarding happening.

terev · 2023-01-30T07:33:09Z

@balajiv113 Ohhh ok I think I understand. It seems like forwarding traffic is the ideal choice for DNS but there's a way to respond from within the guest which may help. But for other tcp and udp connections they have a large amount of added latency due to the virtual network forwarding.

Weird that there's so much latency added for forwarded traffic. I assume that's a huge rabbit hole to investigate. Should an issue be opened with the gvisor-tap-vsock project?

balajiv113 · 2023-01-30T12:35:59Z

@terev As far as i checked the latency is consistent between both frameworks (slirp & gvisor-tap-vsock).

terev · 2023-01-31T15:45:07Z

@balajiv113 Oh damn alright that's unfortunate. I'm curious how this compares to Docker Desktop.

Are there downsides to responding to DNS from the gateway? Doing this seems like it'd be worth it since its a significant latency improvement.

jandubois · 2023-01-31T18:32:13Z

Are there downsides to responding to DNS from the gateway?

Not really, which is why we should be switching to it. AFAIK the only piece missing right now is to be able to define additional static names (like host.lima.internal, or additional entries from lima.yaml).

terev · 2023-02-06T20:42:00Z

Fwiw i just ran the same experiment with lima on 0.14.2 and dns lookups are fast on VZ:

/ # for i in $(seq 1 10); do curl -s -w 'lookup: %{time_namelookup} total: %{time_total}\n' -o /dev/null https://www.google.com ; done
lookup: 0.039751 total: 0.279438
lookup: 0.016078 total: 0.251529
lookup: 0.012076 total: 0.308732
lookup: 0.012782 total: 0.233175
lookup: 0.025466 total: 0.297233
lookup: 0.014166 total: 0.260644
lookup: 0.022045 total: 0.259245
lookup: 0.014491 total: 0.265971
lookup: 0.013092 total: 0.236433
lookup: 0.014356 total: 0.252456

terev · 2023-05-01T19:20:46Z

This still appears to be occurring for us. It seems like the issue appears to get worse the longer the vm is running.

benh4ckett · 2023-05-04T15:48:27Z

FWIW, a bunch of people at my company are seeing this under vz. Some people have switched back to qemu (though qemu has its own set of issues). As someone who doesn't really fully understand this issue, I'm not sure if I can be of much help, but if you folks think of anything I can do to help let me know. Maybe I can try to figure out a bash script that reproduces the issue as quickly as possible after a colima restart.

AkihiroSuda · 2023-05-04T15:50:22Z

The network stack for vz was updated in #1383 (targeted for v0.16).
Does the issue still occur with this PR?

terev · 2023-05-04T16:52:17Z

Oh neat thanks!

limactl --version
limactl version HEAD-79bc1e1

I added the following to my ~/.lima/_config/networks.yaml file:

networks:
  ...
  user-v2:
    mode: user-v2

And using the following lima config:

cpus: 2
memory: 6GiB
# Example to run lima instance with experimental user-v2 network enabled
images:
  - location: "https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img"
    arch: "x86_64"
  - location: "https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-arm64.img"
    arch: "aarch64"
vmType: "vz"
mountType: "virtiofs"
rosetta:
  # Enable Rosetta for Linux.
  # Hint: try `softwareupdate --install-rosetta` if Lima gets stuck at `Installing rosetta...`
  enabled: true
  # Register rosetta to /proc/sys/fs/binfmt_misc
  binfmt: true
mounts:
  - location: "~"
  - location: "/tmp/lima"
    writable: true
networks:
  - lima: user-v2

But it seems dns lookups are still slow:

for i in $(seq 1 10); do curl -s -w 'lookup: %{time_namelookup} total: %{time_total}\n' -o /dev/null https://www.google.com ; done
lookup: 1.004340 total: 1.277296
lookup: 1.004924 total: 1.275450
lookup: 1.005430 total: 1.288956
lookup: 1.006277 total: 1.269120
lookup: 1.004637 total: 1.247322
lookup: 1.005678 total: 1.274065
lookup: 1.004407 total: 1.272516
lookup: 1.013946 total: 1.270648
lookup: 1.002987 total: 1.278580
lookup: 1.004400 total: 1.263776

terev · 2023-05-04T17:06:04Z

FWIW it seems turning the host resolver off with gvproxy seems to help with latency but after a while we often end up seeing hanging as a result of some network request.

balajiv113 · 2023-05-04T17:17:45Z

But it seems dns lookups are still slow:

True, as part of new network stack DNS resolution is not yet moved to this new modal. I will provide this support in a follow up soon.

hanging as a result of some network request.

Can you check the same with new network stack in lima like mentioned above.

terev · 2023-05-04T19:55:25Z

Yeah turning the host resolver off for user-v2 shows significant latency improvement too.

for i in $(seq 1 10); do curl -s -w 'lookup: %{time_namelookup} total: %{time_total}\n' -o /dev/null https://www.google.com ; done
lookup: 0.033342 total: 0.290496
lookup: 0.000825 total: 0.266588
lookup: 0.000858 total: 0.270565
lookup: 0.000861 total: 0.268383
lookup: 0.000876 total: 0.277635
lookup: 0.000808 total: 0.242906
lookup: 0.000641 total: 0.260983
lookup: 0.000769 total: 0.239489
lookup: 0.000787 total: 0.247760
lookup: 0.000792 total: 0.241116

kconley-sq · 2023-06-06T21:52:10Z

Fwiw i just ran the same experiment with lima on 0.14.2 and dns lookups are fast on VZ:

/ # for i in $(seq 1 10); do curl -s -w 'lookup: %{time_namelookup} total: %{time_total}\n' -o /dev/null https://www.google.com ; done
lookup: 0.039751 total: 0.279438
lookup: 0.016078 total: 0.251529
lookup: 0.012076 total: 0.308732
lookup: 0.012782 total: 0.233175
lookup: 0.025466 total: 0.297233
lookup: 0.014166 total: 0.260644
lookup: 0.022045 total: 0.259245
lookup: 0.014491 total: 0.265971
lookup: 0.013092 total: 0.236433
lookup: 0.014356 total: 0.252456

@terev can you confirm if in #1333 (comment) you meant that this slow DNS resolution issue is totally not reproducible when using a vz VM via Colima if you revert your lima version to 0.14.2?

herrernst · 2023-08-30T09:38:55Z

I've just freshly installed lima and started the experimental/vz, still have slow DNS resolution. limactl version 0.17.2

lookup: 1.005215 total: 1.181721
lookup: 1.004708 total: 1.176699
lookup: 1.004669 total: 1.206167
lookup: 1.004570 total: 1.197191

Also seeing this in ha.stderr.log:

{"level":"debug","msg":"Stopping udp proxy (read udp 192.168.5.2:52877: i/o timeout)","time":"2023-08-30T11:33:02+02:00"}
{"level":"debug","msg":"Stopping udp proxy (read udp 192.168.5.2:52877: i/o timeout)","time":"2023-08-30T11:33:02+02:00"}
{"level":"debug","msg":"Stopping udp proxy (read udp 192.168.5.2:52877: i/o timeout)","time":"2023-08-30T11:33:03+02:00"}
{"level":"debug","msg":"Stopping udp proxy (read udp 192.168.5.2:52877: i/o timeout)","time":"2023-08-30T11:33:05+02:00"}
{"level":"debug","msg":"Stopping udp proxy (read udp 192.168.5.2:52877: i/o timeout)","time":"2023-08-30T11:33:06+02:00"}

terev · 2023-08-31T18:58:22Z

@herrernst it looks like they're working on a pr that will hopefully address this issue for good. but in the meantime you can improve dns lookup performance by passing a dns config and disabling the host resolver when creating an instance. https://github.com/lima-vm/lima/blob/master/docs/network.md#dns-19216853

balajiv113 · 2023-09-03T16:23:22Z

This issue should be fixed in latest master.
336c669

Do give a try and let us know if you face similar issues

terev · 2023-09-03T21:38:51Z

@balajiv113 Amazing thank you for your work! I'll give it a try soon. Is there any config required or is this the default network/DNS config with the VZ driver?

balajiv113 · 2023-09-04T02:03:17Z

For vz no config needed it should automatically get fixed to this DNS resolution

herrernst · 2023-09-04T10:14:06Z

@balajiv113 Thanks, DNS and network in general seem to be quick now, and also no more udp timeout errors in logs.

terev · 2023-09-05T06:38:59Z

@balajiv113 So far I've had some success with this change. When it works it seems awesome. However I've experienced situations where after I start the instance the network works briefly then seems to stop responding. When this occurs I see the following messages in the ha.stderr.log file:

{"level":"info","msg":"Forwarding \"/run/lima-guestagent.sock\" (guest) to \"/Users/trevorfoster/.lima/colima/ga.sock\" (host)","time":"2023-09-05T02:33:14-04:00"}
{"level":"debug","msg":"guest agent info: \u0026{LocalPorts:[{IP:0.0.0.0 Port:22} {IP::: Port:22}]}","time":"2023-09-05T02:33:14-04:00"}
{"level":"debug","msg":"guest agent event: {Time:2023-09-05 06:33:14.359155463 +0000 UTC LocalPortsAdded:[{IP:0.0.0.0 Port:22} {IP::: Port:22}] LocalPortsRemoved:[] Errors:[]}","time":"2023-09-05T02:33:14-04:00"}
{"level":"info","msg":"Not forwarding TCP 0.0.0.0:22","time":"2023-09-05T02:33:14-04:00"}
{"level":"info","msg":"Not forwarding TCP [::]:22","time":"2023-09-05T02:33:14-04:00"}
{"level":"error","msg":"cannot receive packets from , disconnecting: cannot read size from socket: read unixgram -\u003e: use of closed network connection","time":"2023-09-05T02:33:48-04:00"}
{"level":"error","msg":"FD connection closed with errorcannot read size from socket: read unixgram -\u003e: use of closed network connection","time":"2023-09-05T02:33:48-04:00"}
{"level":"error","msg":"write unixgram -\u003e: write: no buffer space available","time":"2023-09-05T02:33:48-04:00"}

The command I'm using to start the vm is colima start --memory 8 --cpu 4 --mount-type virtiofs --vm-type vz --vz-rosetta
Lima Version: limactl version HEAD-de1b3ee
Colima Version:colima version 0.5.5

I noticed that I don't have user-v2: in my ~/.lima/_config/networks.yaml file. Does that matter?

Thank you again for your work.

balajiv113 · 2023-09-05T07:16:33Z

@terev This looks more of a different one related to network freeze happening in vz #1609.

Also if possible, do share the following details,

What kind of program was running in vm, during the network failure
Are you using M1 ? Share machine details like cpu, memory

Does that matter?

No it doesn't as i believe you are using the default network only not the user-v2

terev · 2023-09-05T07:23:04Z

@balajiv113 Gotcha that does seem to track. I'm using a M1 Macbook 16 GB of memory and 8 cores. I'm starting a large docker compose project which involves pulling lots of docker images. Looking at activity monitor while starting cpu usage is quite high. It gets close to fully using all 4 allocated cores.

balajiv113 · 2023-09-05T07:40:10Z

I'm starting a large docker compose project which involves pulling lots of docker images

If its sharable please do share. I will give a try with my M1 and M2 and see if that happens.

terev · 2023-09-05T08:10:09Z

@balajiv113 Unfortunately I'm unable to share the exact compose project. I was considering trying to reproduce with many projects from https://github.com/docker/awesome-compose though that may run into docker hub rate limits.

aaronlehmann · 2023-09-05T22:40:22Z

@balajiv113: I found that after upgrading from lima to 0.17.2 to HEAD-de1b3ee, DNS wouldn't work with existing VMs. I had to delete and recreate the VM for DNS to work at all. Is that expected?

balajiv113 · 2023-09-06T01:34:09Z

@aaronlehmann
Looks like iptables are not updated properly for old instance. I will check if it can be fixed

balajiv113 · 2023-09-29T02:55:13Z

@AkihiroSuda
I think this can be closed, as we now use gvisor-tap-vsock DNS resolver for user-v2 network and vz default network. So it will be faster now.

terev changed the title ~~Experiencing "Lima get stuck when certain frequency of DNS resolution occurs" on HEAD~~ Experiencing slow network under VZ virtualization Jan 27, 2023

Nino-K mentioned this issue Apr 3, 2023

MacOS DNS regression in 1.5.0 and 1.5.1 rancher-sandbox/rancher-desktop#2811

Open

AkihiroSuda added priority/high component/vz labels May 4, 2023

terev mentioned this issue May 7, 2023

Slow image pulling when using VZ VM type abiosoft/colima#648

Closed

AkihiroSuda mentioned this issue Jun 8, 2023

Roadmap for moving vz out of experimental #1610

Closed

6 tasks

CamJN mentioned this issue Jun 12, 2023

Issue with buildx possibly dns related abiosoft/colima#739

Open

5 tasks

balajiv113 mentioned this issue Jun 13, 2023

Support for configurable network address for user-v2 #1626

Merged

1 task

AkihiroSuda mentioned this issue Sep 11, 2023

user-v2 regression in master: DNS wouldn't work with existing VMs #1783

Closed

AkihiroSuda closed this as completed Sep 29, 2023

aaronlehmann mentioned this issue Oct 18, 2023

Colima vm hangs after a few minutes, using lima HEAD abiosoft/colima#833

Open

5 tasks

Experiencing slow network under VZ virtualization #1333

Experiencing slow network under VZ virtualization #1333

Comments

terev commented Jan 26, 2023 • edited Loading

Description

jandubois commented Jan 26, 2023

terev commented Jan 26, 2023 • edited Loading

jandubois commented Jan 26, 2023

balajiv113 commented Jan 27, 2023

terev commented Jan 27, 2023 • edited Loading

balajiv113 commented Jan 27, 2023

jandubois commented Jan 27, 2023

terev commented Jan 28, 2023 • edited Loading

balajiv113 commented Jan 28, 2023

balajiv113 commented Jan 29, 2023

balajiv113 commented Jan 29, 2023

terev commented Jan 30, 2023 • edited Loading

balajiv113 commented Jan 30, 2023

terev commented Jan 30, 2023 • edited Loading

balajiv113 commented Jan 30, 2023

terev commented Jan 31, 2023 • edited Loading

jandubois commented Jan 31, 2023

terev commented Feb 6, 2023 • edited Loading

terev commented May 1, 2023

benh4ckett commented May 4, 2023

AkihiroSuda commented May 4, 2023

terev commented May 4, 2023 • edited Loading

terev commented May 4, 2023 • edited Loading

balajiv113 commented May 4, 2023

terev commented May 4, 2023

kconley-sq commented Jun 6, 2023

herrernst commented Aug 30, 2023

terev commented Aug 31, 2023

balajiv113 commented Sep 3, 2023

terev commented Sep 3, 2023 • edited Loading

balajiv113 commented Sep 4, 2023

herrernst commented Sep 4, 2023

terev commented Sep 5, 2023 • edited Loading

balajiv113 commented Sep 5, 2023

terev commented Sep 5, 2023 • edited Loading

balajiv113 commented Sep 5, 2023

terev commented Sep 5, 2023

aaronlehmann commented Sep 5, 2023

balajiv113 commented Sep 6, 2023

balajiv113 commented Sep 29, 2023

terev commented Jan 26, 2023 •

edited

Loading

terev commented Jan 26, 2023 •

edited

Loading

terev commented Jan 27, 2023 •

edited

Loading

terev commented Jan 28, 2023 •

edited

Loading

terev commented Jan 30, 2023 •

edited

Loading

terev commented Jan 30, 2023 •

edited

Loading

terev commented Jan 31, 2023 •

edited

Loading

terev commented Feb 6, 2023 •

edited

Loading

terev commented May 4, 2023 •

edited

Loading

terev commented May 4, 2023 •

edited

Loading

terev commented Sep 3, 2023 •

edited

Loading

terev commented Sep 5, 2023 •

edited

Loading

terev commented Sep 5, 2023 •

edited

Loading