Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the overhead of shutting down the Firecracker process #1164

Closed
wkozaczuk opened this issue Jul 9, 2019 · 4 comments
Closed

Reduce the overhead of shutting down the Firecracker process #1164

wkozaczuk opened this issue Jul 9, 2019 · 4 comments
Assignees
Labels
Priority: Low Indicates that an issue or pull request should be resolved behind issues or pull requests labelled ` Type: Bug Indicates an unexpected problem or unintended behavior

Comments

@wkozaczuk
Copy link

Based on some casual measurements it takes 40-60ms from the moment exit() is called to the moment Linux host considers Firecracker dead. For more details please see this and that comment from somewhat related issue.

@andreeaflorescu andreeaflorescu added the Type: Bug Indicates an unexpected problem or unintended behavior label Jul 15, 2019
@alexandruag alexandruag added the Priority: Low Indicates that an issue or pull request should be resolved behind issues or pull requests labelled ` label Jul 15, 2019
@renzhengeek
Copy link

Hi @wkozaczuk

First, I feel sorry for asking out-scoped question here, but your post comparing boot-time between qemu microvm and firecracker really interests me and guide me here.

I'm also interested in evaluating their boot/startup time in a certain fair way for both. What I can come up as below:

  • use same rootfs image
  • user same kernel
  • build a minimum configured qemu, and default build firecracker

However, the specific problems below occur, say if I want to measure the time to run a hello word binary:

  • I saw in your post the test runs with OSv, could you please share a test version without OSv? so that people like without OSv knowledge can easily reproduce the measurement:-)
  • no matter booting with qemu or firecracker, it always interacts to prompt to login, only after login, can I run the binary; Is there a way to run the test in batch mode so that we can easily get the overall time?

I see you gave very detailed time breakdown in other place, very good. I would be much appreciated if you can shed some experiences :-)

@serban300
Copy link
Contributor

I've been experimenting a bit with this issue, trying to add unsafe {libc::_exit(0)}; at different places during the vm startup process in order to compare the shutting down overhead.

Adding unsafe {libc::_exit(0)}; before setup_interrupt_controller(&mut vm)?; leads to almost no overhead:

17:41:43.955618 exit_group(0)           = ?
17:41:43.958232 +++ exited with 0 +++

Adding unsafe {libc::_exit(0)}; after setup_interrupt_controller(&mut vm)?; leads to an overhead of about 20ms:

17:48:01.881943 exit_group(0)           = ?
17:48:01.900969 +++ exited with 0 +++

Next, injecting unsafe {libc::_exit(0)}; after

create_pio_dev_manager_with_legacy_devices(&vm, serial_device, reset_evt)
            .map_err(Internal)?

Leads to a 30ms overhead:

17:58:44.942634 exit_group(0)           = ?
17:58:44.976892 +++ exited with 0 +++

I guess it has something to do with the cost of destroying these objects. But we would have to dive deeper.

@serban300
Copy link
Contributor

Hi @wkozaczuk ,

I did a bit more research on this.

As far as I understand, indeed the process counts as fully destroyed only when the parent fetches the exit status via wait4 here. But this seems correct, since from what I understand wait4 returns only after the kvm_destroy_vm() method has finished execution. This method cleans up the old VM.

As for the added latency, I could track a big chunk of it (about 20ms) here:

kvm_destroy_vm()
-> kvm_arch_sync_events(kvm)
-> kvm_free_pit(kvm)
-> kvm_pit_set_reinject(pit, false)
-> kvm_unregister_irq_mask_notifier(kvm, 0, &pit->mask_notifier)
-> synchronize_srcu(&kvm->irq_srcu)

As far as I know, this is the expected RCU behavior and synchronize_srcu() ends up waiting for a full grace period. We stumble upon a similar issue here.

I will close this issue since it doesn't seem related to Firecracker. Feel free to reopen it if needed.

@serban300 serban300 self-assigned this Jan 11, 2021
@Nugine
Copy link

Nugine commented Mar 15, 2021

I ran into the same problem elsewhere. Here is what I have found. Maybe it will help you.

The program below spawns a process in a new IPC namespace. Then the new process replaces itself with /bin/ls.

argv[1] controls whether to create a new IPC namespace.

#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/utsname.h>
#include <sys/wait.h>
#include <unistd.h>

#define err_exit(msg)                                                          \
    do {                                                                       \
        perror(msg);                                                           \
        exit(EXIT_FAILURE);                                                    \
    } while (0)

static int child_fn(void *arg) {
    char *const argv[] = {"/bin/ls", NULL};
    char *const env[] = {NULL};
    execve(argv[0], argv, env);
    return 0;
}

#define STACK_SIZE (1024 * 1024)

int main(int argc, char *argv[]) {
    int new_ipc = argc == 2 && strcmp(argv[1], "NEWIPC") == 0;

    char *stack;
    char *stack_top;
    pid_t pid;

    stack = malloc(STACK_SIZE);
    stack_top = stack + STACK_SIZE;

    int flags = SIGCHLD;

    if (new_ipc) {
        flags |= CLONE_NEWIPC;
    }

    pid = clone(child_fn, stack_top, flags, argv[1]);
    printf("clone() returned %ld\n", (long)pid);

    waitpid(pid, NULL, 0);
    printf("child has terminated\n");

    return 0;
}
root@Nugine-Ub2:~/tests/ipc_ns_delay# ls
ipc.c
root@Nugine-Ub2:~/tests/ipc_ns_delay# gcc ipc.c

Execute ./a.out with strace.

root@Nugine-Ub2:~/tests/ipc_ns_delay# strace -tt -f -e wait4,execve,exit_group,exit,clone ./a.out
19:32:58.555131 execve("./a.out", ["./a.out"], 0x7ffdf36ef710 /* 42 vars */) = 0
19:32:58.558474 clone(child_stack=0x7f4099218000, flags=SIGCHLD) = 30483
strace: Process 30483 attached
clone() returned 30483
[pid 30483] 19:32:58.558900 execve("/bin/ls", ["/bin/ls"], 0x7f4099217fd8 /* 0 vars */ <unfinished ...>
[pid 30482] 19:32:58.559053 wait4(30483,  <unfinished ...>
[pid 30483] 19:32:58.559418 <... execve resumed> ) = 0
a.out  ipc.c
[pid 30483] 19:32:58.567093 exit_group(0) = ?
[pid 30483] 19:32:58.567568 +++ exited with 0 +++
19:32:58.567625 <... wait4 resumed> NULL, 0, NULL) = 30483
19:32:58.567693 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=30483, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
child has terminated
19:32:58.567931 exit_group(0)           = ?
19:32:58.568292 +++ exited with 0 +++

[pid 30483] The duration between exit_group and exited is 567658-567093=475 us

Execute ./a.out NEWIPC with strace.

root@Nugine-Ub2:~/tests/ipc_ns_delay# strace -tt -f -e wait4,execve,exit_group,exit,clone ./a.out NEWIPC
19:33:01.580376 execve("./a.out", ["./a.out", "NEWIPC"], 0x7ffcba451678 /* 42 vars */) = 0
19:33:01.583591 clone(strace: Process 30487 attached
child_stack=0x7f230c3d6000, flags=CLONE_NEWIPC|SIGCHLD) = 30487
[pid 30487] 19:33:01.584165 execve("/bin/ls", ["/bin/ls"], 0x7f230c3d5fd8 /* 0 vars */clone() returned 30487
 <unfinished ...>
[pid 30486] 19:33:01.584685 wait4(30487,  <unfinished ...>
[pid 30487] 19:33:01.584801 <... execve resumed> ) = 0
a.out  ipc.c
[pid 30487] 19:33:01.592219 exit_group(0) = ?
[pid 30487] 19:33:01.617658 +++ exited with 0 +++
19:33:01.617759 <... wait4 resumed> NULL, 0, NULL) = 30487
19:33:01.617898 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=30487, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
child has terminated
19:33:01.618275 exit_group(0)           = ?
19:33:01.618712 +++ exited with 0 +++
root@Nugine-Ub2:~/tests/ipc_ns_delay# 

[pid 30487] The duration between exit_group and exited is 617658-592219=25439 us

It seems that the IPC namespace causes the overhead(about 25ms) of shutting down.

Is it a kernel bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Low Indicates that an issue or pull request should be resolved behind issues or pull requests labelled ` Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

7 participants