Snapshot load latency is too high. #2027

sandreim · 2020-07-16T13:36:06Z

I have observed high load times while doing some testing on a i7-8700 CPU @ 3.20GHz host.
Microvm size: 2vCPU 512MB RAM.

13:32:48 snapshot_load: INFO Latency 1/5: 13.352 ms
13:32:49 snapshot_load: INFO Latency 2/5: 29.512 ms
13:32:50 snapshot_load: INFO Latency 3/5: 29.562 ms
13:32:51 snapshot_load: INFO Latency 4/5: 31.069 ms
13:32:51 snapshot_load: INFO Latency 5/5: 33.036 ms

We are targeting < 5ms load times.

The text was updated successfully, but these errors were encountered:

ustiugov · 2020-07-16T13:51:44Z

This is what we measured as well (server-grade Haswell e5 host, 256GB RAM, guest: 1 vCPU, 512MB RAM). However, repetitive snapshot load delay goes down to ~7-10ms thanks to the (host OS) page cache.

sandreim · 2020-07-16T14:09:30Z

@ustiugov Thanks for mentioning this. How did you measure the load latency ? did you use the FC metrics?

The numbers still look high than what we expected and some profiling would reveal where we actually spend so many cycles.

ustiugov · 2020-07-16T14:44:49Z

@sandreim I made end-to-end measurements with a simple driver Go program that issued HTTP requests via the Firecracker vsock. We plan to break this delay down to a finer granularity later.

I know that some snapshotting metrics have been merged but I haven't used them yet. In my understanding, these metrics are too coarse so I doubt that they would be enough to define the room for improvement. We would need to time each device/component restoration for that.

sandreim · 2020-07-17T11:12:10Z

@sandreim I made end-to-end measurements with a simple driver Go program that issued HTTP requests via the Firecracker vsock. We plan to break this delay down to a finer granularity later.

I know that some snapshotting metrics have been merged but I haven't used them yet. In my understanding, these metrics are too coarse so I doubt that they would be enough to define the room for improvement. We would need to time each device/component restoration for that.

The current metrics offer a minimal level of granularity for latency measurements:

API level - measuring the time spent from receiving the HTTP request to sending the HTTP reply
VMM level - measuring the time spent to load the snapshot file, restore all vCPUs (paused) and all devices

For debugging and perf regression testing I think it makes sense to have some fine grained metrics: per vcpu/device type/state deserialization. I will discuss it with the team and see what options are available.

raduweiss · 2020-07-26T08:41:10Z

Let's keep this open since we're still not exactly at the right numbers. To set expectations, we will prioritize #1997 first, and then get back to this. Any user testing data will help.

@ustiugov it would be great if you can post the numbers after the #2037 mentioning what microVM size (vCPU count, memory size, net/block/vsock device counts) you're loading, and what system you're running on. It looks like there's a lot of variance here.

@sandreim 's dev machine:

snapshot_load: INFO Latency 1/50: 2.971 ms
snapshot_load: INFO Latency 2/50: 3.5 ms
snapshot_load: INFO Latency 3/50: 3.433 ms
snapshot_load: INFO Latency 4/50: 3.268 ms
[...]
snapshot_load: INFO Latency 15/50: 9.436 ms
[...]

But, on m5d.metal, just the KVM vm creation time 20+ ms, and looking into the hotpath segments of that, it's dominated by KVM:

[Resume hotpath] Kvm vm creation time: 20791 us

sandreim · 2020-07-27T10:36:45Z

My dev machine has Linux 5.3. I think it is worth trying to see how these numbers look on newer kernels.

ustiugov · 2020-07-28T13:10:46Z

@raduweiss @sandreim thank you!

Host: Intel Xeon E5-2680 v3, 256 RAM, SATA3 SSD; Ubuntu 18, v4.15
Guest: 1 vCPU, 512RAM, 1 disk, 1 net.
Docker image that I used to build the rootfs is ustiugov/helloworld:var_workload (it has a gRPC server and python 3 atop of Alpine)

When measuring, I fix CPU frequency and use some other kernel boot args to stabilize the measurements (you may want to do the same to reduce variation).

Here are the results for the end-to-end load HTTP request to Firecracker in millisec:
With host OS page cache forcefully flushed before each measurement

33.9
22.2
13.8
13.7
13.7
14.0
13.6

Seems that first couple of requests experience some warm-up delay, but after that, the latency is pretty stable.

If I don't flush the page cache on the host:

4.9
4.3
4.1
4.0
4.9
4.1
4.0

serban300 · 2020-09-11T14:37:50Z

There are big differences depending on the host kernel version. For example on an AMD Ryzen 1700X host, for a VM with 2vCPUs and 256Mb memory I'm getting 30ms restore time on host kernel 5.4 vs 4ms on host kernel 4.19.

serban300 · 2021-04-28T08:28:29Z

Closing in favor of #2129

sandreim added Performance: Misc Priority: High Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled labels Jul 16, 2020

sandreim mentioned this issue Jul 21, 2020

Use BufReader to load snapshots. #2037

Merged

9 tasks

dianpopa closed this as completed in #2037 Jul 21, 2020

raduweiss reopened this Jul 26, 2020

georgepisaltu mentioned this issue Dec 10, 2020

Does Firecracker support AMD Secure Encrypted Virtualization (SEV)? #2332

Closed

serban300 closed this as completed Apr 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot load latency is too high. #2027

Snapshot load latency is too high. #2027

sandreim commented Jul 16, 2020 •

edited

Loading

ustiugov commented Jul 16, 2020

sandreim commented Jul 16, 2020 •

edited

Loading

ustiugov commented Jul 16, 2020

sandreim commented Jul 17, 2020

raduweiss commented Jul 26, 2020

sandreim commented Jul 27, 2020

ustiugov commented Jul 28, 2020

serban300 commented Sep 11, 2020

serban300 commented Apr 28, 2021

Snapshot load latency is too high. #2027

Snapshot load latency is too high. #2027

Comments

sandreim commented Jul 16, 2020 • edited Loading

ustiugov commented Jul 16, 2020

sandreim commented Jul 16, 2020 • edited Loading

ustiugov commented Jul 16, 2020

sandreim commented Jul 17, 2020

raduweiss commented Jul 26, 2020

sandreim commented Jul 27, 2020

ustiugov commented Jul 28, 2020

serban300 commented Sep 11, 2020

serban300 commented Apr 28, 2021

sandreim commented Jul 16, 2020 •

edited

Loading

sandreim commented Jul 16, 2020 •

edited

Loading