Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: tune timeouts based on VM/OS/arch/etc #1552

Closed
dvyukov opened this issue Dec 19, 2019 · 6 comments
Closed

executor: tune timeouts based on VM/OS/arch/etc #1552

dvyukov opened this issue Dec 19, 2019 · 6 comments

Comments

@dvyukov
Copy link
Collaborator

dvyukov commented Dec 19, 2019

We have a number of timeouts hardcoded though out the codebase, most of them were tuned for Linux/amd64/KVM (fast native execution). However, we have a number of contexts where timings can be significantly different, e.g. running in qemu with emulation makes everything 10x slower, or other OSes may need a different per-syscall/program timeout (we can't simply increase them as they significantly affect fuzzing performance).
What would be nice to have is some kind of flexible mechanism that would allow to tune these timeouts through out the codebase based on OS/arch/VM/etc. Perhaps rooted in sys/targets as we have a number of such global parameters there already.
List of timeouts:

  • executor: per-syscall timeout
  • executor: program finalization timeout
  • executor: program watchdog timeout
  • pkg/ipc: handshake timeout
  • pkg/ipc: program watchdog timeout
  • vm: no output/lost connection timeout
@kaartine
Copy link

This patch helped enabled me to run arm64 tests with x86 host and arm64 qemu.

From 2cf91285966ca902408feffb346360e32ebd8899 Mon Sep 17 00:00:00 2001
From: Jukka Kaartinen <[email protected]>
Date: Fri, 20 Dec 2019 09:24:17 +0200
Subject: [PATCH 1/1] Extra time for running tests in qemu env witout kvm

Signed-off-by: Jukka Kaartinen <[email protected]>
---
 pkg/ipc/ipc.go | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/pkg/ipc/ipc.go b/pkg/ipc/ipc.go
index a238d2ba..67855be8 100644
--- a/pkg/ipc/ipc.go
+++ b/pkg/ipc/ipc.go
@@ -669,7 +669,7 @@ func (c *command) handshake() error {
                read <- nil
        }()
        // Sandbox setup can take significant time.
-       timeout := time.NewTimer(time.Minute)
+       timeout := time.NewTimer(4*time.Minute)
        select {
        case err := <-read:
                timeout.Stop()
@@ -800,8 +800,8 @@ func (c *command) exec(opts *ExecOpts, progData []byte) (output []byte, hanged b
 
 func sanitizeTimeout(config *Config) time.Duration {
        const (
-               executorTimeout = 5 * time.Second
-               minTimeout      = executorTimeout + 2*time.Second
+               executorTimeout = 120 * time.Second
+               minTimeout      = executorTimeout + 30*time.Second
        )
        timeout := config.Timeout
        if timeout == 0 {
-- 
2.17.1

@kaartine
Copy link

Also I think it is forth mention that at least in our set we had annoying issue with very slow sshd start up time with arm64 qemu running on x86.
I followed these instructions from here.

It turned out that reason was low entropy and it was easily "fixed" with installing haveged. It can be found from the buildroot -> make menuconfig

    Target packages	    
	    Miscellaneous
	        [*] haveged

@xairy
Copy link
Collaborator

xairy commented Dec 20, 2019

It turned out that reason was low entropy and it was easily "fixed" with installing haveged. It can be found from the buildroot -> make menuconfig

I've added this note into the instructions. Thanks!

@dvyukov
Copy link
Collaborator Author

dvyukov commented Jul 6, 2020

FTR, also "executor not serving" and "no output" errors for arm64 emulation are mentioned here:
https://groups.google.com/forum/#!topic/syzkaller/x1d7j-Z-kHo

@dvyukov
Copy link
Collaborator Author

dvyukov commented Dec 14, 2020

FTR: one other example of a slower configuration is gvisor+race+cover.

Slowdown for Go race+cover seems to be insanely huge. For all other combinations of coverage/race mmap syscall takes ~1-2ms, but for race+coverage it takes ~350ms. I guess this is because the code is sprinkled with atomic increments which become super expensive under race detector.
I think go compiler should not instrumentation coverage instrumentation (both atomic and non-atomic modes) and go tool cover should use non-atomic coverage under race detector. Otherwise we get both extremely high slowdown, little benefit from checking all coverage operations and very high false negative rate from race detector (executing the same code synchronizes with each other). But this will require changing both go compiler, go tool and bazel...
cc @dean-deng

@dvyukov
Copy link
Collaborator Author

dvyukov commented Jan 15, 2021

I think this can be considered fixed:

// Timeouts structure parametrizes timeouts throughout the system.
// It allows to support different operating system, architectures and execution environments
// (emulation, models, etc) without scattering and duplicating knowledge about their execution
// performance everywhere.
// Timeouts calculation consists of 2 parts: base values and scaling.
// Base timeout values consist of a single syscall timeout, program timeout and "no output" timeout
// and are specified by the target (OS/arch), or defaults are used.
// Scaling part is calculated from the execution environment in pkg/mgrconfig based on VM type,
// kernel build type, emulation, etc. Scaling is specifically converged to a single number so that
// it can be specified/overridden for command line tools (e.g. syz-execprog -slowdown=10).
type Timeouts struct {
// Base scaling factor, used only for a single syscall timeout.
Slowdown int
// Capped scaling factor used for timeouts other than syscall timeout.
// It's already applied to all values in this struct, but can be used for one-off timeout values
// in the system. This should also be applied to syscall/program timeout attributes in syscall descriptions.
// Derived from Slowdown and should not be greater than Slowdown.
// The idea behind capping is that slowdown can be large (10-20) and most timeouts already
// include some safety margin. If we just multiply them we will get too large timeouts,
// e.g. program timeout can become 5s*20 = 100s, or "no output" timeout: 5m*20 = 100m.
Scale time.Duration
// Timeout for a single syscall, after this time the syscall is considered "blocked".
Syscall time.Duration
// Timeout for a single program execution.
Program time.Duration
// Timeout for "no output" detection.
NoOutput time.Duration
// Limit on a single VM running time, after this time a VM is restarted.
VMRunningTime time.Duration
// How long we should test to get "no output" error (derivative of NoOutput, here to avoid duplication).
NoOutputRunningTime time.Duration
}

@dvyukov dvyukov closed this as completed Jan 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants