-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot load latency is too high. #2027
Comments
This is what we measured as well (server-grade Haswell e5 host, 256GB RAM, guest: 1 vCPU, 512MB RAM). However, repetitive snapshot load delay goes down to ~7-10ms thanks to the (host OS) page cache. |
@ustiugov Thanks for mentioning this. How did you measure the load latency ? did you use the FC metrics? The numbers still look high than what we expected and some profiling would reveal where we actually spend so many cycles. |
@sandreim I made end-to-end measurements with a simple driver Go program that issued HTTP requests via the Firecracker vsock. We plan to break this delay down to a finer granularity later. I know that some snapshotting metrics have been merged but I haven't used them yet. In my understanding, these metrics are too coarse so I doubt that they would be enough to define the room for improvement. We would need to time each device/component restoration for that. |
The current metrics offer a minimal level of granularity for latency measurements:
For debugging and perf regression testing I think it makes sense to have some fine grained metrics: per vcpu/device type/state deserialization. I will discuss it with the team and see what options are available. |
Let's keep this open since we're still not exactly at the right numbers. To set expectations, we will prioritize #1997 first, and then get back to this. Any user testing data will help. @ustiugov it would be great if you can post the numbers after the #2037 mentioning what microVM size (vCPU count, memory size, net/block/vsock device counts) you're loading, and what system you're running on. It looks like there's a lot of variance here. @sandreim 's dev machine:
But, on m5d.metal, just the KVM vm creation time 20+ ms, and looking into the hotpath segments of that, it's dominated by KVM:
|
My dev machine has Linux 5.3. I think it is worth trying to see how these numbers look on newer kernels. |
@raduweiss @sandreim thank you! Host: Intel Xeon E5-2680 v3, 256 RAM, SATA3 SSD; Ubuntu 18, v4.15 When measuring, I fix CPU frequency and use some other kernel boot args to stabilize the measurements (you may want to do the same to reduce variation). Here are the results for the end-to-end load HTTP request to Firecracker in millisec:
Seems that first couple of requests experience some warm-up delay, but after that, the latency is pretty stable. If I don't flush the page cache on the host:
|
There are big differences depending on the host kernel version. For example on an AMD Ryzen 1700X host, for a VM with 2vCPUs and 256Mb memory I'm getting 30ms restore time on |
Closing in favor of #2129 |
I have observed high load times while doing some testing on a
i7-8700 CPU @ 3.20GHz
host.Microvm size: 2vCPU 512MB RAM.
We are targeting < 5ms load times.
The text was updated successfully, but these errors were encountered: