Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot load latency is too high. #2027

Closed
sandreim opened this issue Jul 16, 2020 · 9 comments · Fixed by #2037
Closed

Snapshot load latency is too high. #2027

sandreim opened this issue Jul 16, 2020 · 9 comments · Fixed by #2037
Labels
Priority: High Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled

Comments

@sandreim
Copy link
Contributor

sandreim commented Jul 16, 2020

I have observed high load times while doing some testing on a i7-8700 CPU @ 3.20GHz host.
Microvm size: 2vCPU 512MB RAM.

13:32:48 snapshot_load: INFO Latency 1/5: 13.352 ms
13:32:49 snapshot_load: INFO Latency 2/5: 29.512 ms
13:32:50 snapshot_load: INFO Latency 3/5: 29.562 ms
13:32:51 snapshot_load: INFO Latency 4/5: 31.069 ms
13:32:51 snapshot_load: INFO Latency 5/5: 33.036 ms

We are targeting < 5ms load times.

@sandreim sandreim added Performance: Misc Priority: High Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled labels Jul 16, 2020
@ustiugov
Copy link
Contributor

This is what we measured as well (server-grade Haswell e5 host, 256GB RAM, guest: 1 vCPU, 512MB RAM). However, repetitive snapshot load delay goes down to ~7-10ms thanks to the (host OS) page cache.

@sandreim
Copy link
Contributor Author

sandreim commented Jul 16, 2020

@ustiugov Thanks for mentioning this. How did you measure the load latency ? did you use the FC metrics?

The numbers still look high than what we expected and some profiling would reveal where we actually spend so many cycles.

@ustiugov
Copy link
Contributor

@sandreim I made end-to-end measurements with a simple driver Go program that issued HTTP requests via the Firecracker vsock. We plan to break this delay down to a finer granularity later.

I know that some snapshotting metrics have been merged but I haven't used them yet. In my understanding, these metrics are too coarse so I doubt that they would be enough to define the room for improvement. We would need to time each device/component restoration for that.

@sandreim
Copy link
Contributor Author

@sandreim I made end-to-end measurements with a simple driver Go program that issued HTTP requests via the Firecracker vsock. We plan to break this delay down to a finer granularity later.

I know that some snapshotting metrics have been merged but I haven't used them yet. In my understanding, these metrics are too coarse so I doubt that they would be enough to define the room for improvement. We would need to time each device/component restoration for that.

The current metrics offer a minimal level of granularity for latency measurements:

  • API level - measuring the time spent from receiving the HTTP request to sending the HTTP reply
  • VMM level - measuring the time spent to load the snapshot file, restore all vCPUs (paused) and all devices

For debugging and perf regression testing I think it makes sense to have some fine grained metrics: per vcpu/device type/state deserialization. I will discuss it with the team and see what options are available.

@raduweiss
Copy link
Contributor

Let's keep this open since we're still not exactly at the right numbers. To set expectations, we will prioritize #1997 first, and then get back to this. Any user testing data will help.

@ustiugov it would be great if you can post the numbers after the #2037 mentioning what microVM size (vCPU count, memory size, net/block/vsock device counts) you're loading, and what system you're running on. It looks like there's a lot of variance here.

@sandreim 's dev machine:

snapshot_load: INFO Latency 1/50: 2.971 ms
snapshot_load: INFO Latency 2/50: 3.5 ms
snapshot_load: INFO Latency 3/50: 3.433 ms
snapshot_load: INFO Latency 4/50: 3.268 ms
[...]
snapshot_load: INFO Latency 15/50: 9.436 ms
[...]

But, on m5d.metal, just the KVM vm creation time 20+ ms, and looking into the hotpath segments of that, it's dominated by KVM:

[Resume hotpath] Kvm vm creation time: 20791 us

@raduweiss raduweiss reopened this Jul 26, 2020
@sandreim
Copy link
Contributor Author

My dev machine has Linux 5.3. I think it is worth trying to see how these numbers look on newer kernels.

@ustiugov
Copy link
Contributor

@raduweiss @sandreim thank you!

Host: Intel Xeon E5-2680 v3, 256 RAM, SATA3 SSD; Ubuntu 18, v4.15
Guest: 1 vCPU, 512RAM, 1 disk, 1 net.
Docker image that I used to build the rootfs is ustiugov/helloworld:var_workload (it has a gRPC server and python 3 atop of Alpine)

When measuring, I fix CPU frequency and use some other kernel boot args to stabilize the measurements (you may want to do the same to reduce variation).

Here are the results for the end-to-end load HTTP request to Firecracker in millisec:
With host OS page cache forcefully flushed before each measurement

33.9
22.2
13.8
13.7
13.7
14.0
13.6

Seems that first couple of requests experience some warm-up delay, but after that, the latency is pretty stable.

If I don't flush the page cache on the host:

4.9
4.3
4.1
4.0
4.9
4.1
4.0

@serban300
Copy link
Contributor

There are big differences depending on the host kernel version. For example on an AMD Ryzen 1700X host, for a VM with 2vCPUs and 256Mb memory I'm getting 30ms restore time on host kernel 5.4 vs 4ms on host kernel 4.19.

@serban300
Copy link
Contributor

Closing in favor of #2129

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: High Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants