executor: tune timeouts based on VM/OS/arch/etc #1552

dvyukov · 2019-12-19T09:11:59Z

We have a number of timeouts hardcoded though out the codebase, most of them were tuned for Linux/amd64/KVM (fast native execution). However, we have a number of contexts where timings can be significantly different, e.g. running in qemu with emulation makes everything 10x slower, or other OSes may need a different per-syscall/program timeout (we can't simply increase them as they significantly affect fuzzing performance).
What would be nice to have is some kind of flexible mechanism that would allow to tune these timeouts through out the codebase based on OS/arch/VM/etc. Perhaps rooted in sys/targets as we have a number of such global parameters there already.
List of timeouts:

executor: per-syscall timeout
executor: program finalization timeout
executor: program watchdog timeout
pkg/ipc: handshake timeout
pkg/ipc: program watchdog timeout
vm: no output/lost connection timeout

kaartine · 2019-12-20T07:31:31Z

This patch helped enabled me to run arm64 tests with x86 host and arm64 qemu.

From 2cf91285966ca902408feffb346360e32ebd8899 Mon Sep 17 00:00:00 2001
From: Jukka Kaartinen <[email protected]>
Date: Fri, 20 Dec 2019 09:24:17 +0200
Subject: [PATCH 1/1] Extra time for running tests in qemu env witout kvm

Signed-off-by: Jukka Kaartinen <[email protected]>
---
 pkg/ipc/ipc.go | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/pkg/ipc/ipc.go b/pkg/ipc/ipc.go
index a238d2ba..67855be8 100644
--- a/pkg/ipc/ipc.go
+++ b/pkg/ipc/ipc.go
@@ -669,7 +669,7 @@ func (c *command) handshake() error {
                read <- nil
        }()
        // Sandbox setup can take significant time.
-       timeout := time.NewTimer(time.Minute)
+       timeout := time.NewTimer(4*time.Minute)
        select {
        case err := <-read:
                timeout.Stop()
@@ -800,8 +800,8 @@ func (c *command) exec(opts *ExecOpts, progData []byte) (output []byte, hanged b
 
 func sanitizeTimeout(config *Config) time.Duration {
        const (
-               executorTimeout = 5 * time.Second
-               minTimeout      = executorTimeout + 2*time.Second
+               executorTimeout = 120 * time.Second
+               minTimeout      = executorTimeout + 30*time.Second
        )
        timeout := config.Timeout
        if timeout == 0 {
-- 
2.17.1

kaartine · 2019-12-20T08:25:46Z

Also I think it is forth mention that at least in our set we had annoying issue with very slow sshd start up time with arm64 qemu running on x86.
I followed these instructions from here.

It turned out that reason was low entropy and it was easily "fixed" with installing haveged. It can be found from the buildroot -> make menuconfig

    Target packages	    
	    Miscellaneous
	        [*] haveged

xairy · 2019-12-20T12:58:49Z

It turned out that reason was low entropy and it was easily "fixed" with installing haveged. It can be found from the buildroot -> make menuconfig

I've added this note into the instructions. Thanks!

dvyukov · 2020-07-06T08:21:57Z

FTR, also "executor not serving" and "no output" errors for arm64 emulation are mentioned here:
https://groups.google.com/forum/#!topic/syzkaller/x1d7j-Z-kHo

dvyukov · 2020-12-14T18:34:37Z

FTR: one other example of a slower configuration is gvisor+race+cover.

Slowdown for Go race+cover seems to be insanely huge. For all other combinations of coverage/race mmap syscall takes ~1-2ms, but for race+coverage it takes ~350ms. I guess this is because the code is sprinkled with atomic increments which become super expensive under race detector.
I think go compiler should not instrumentation coverage instrumentation (both atomic and non-atomic modes) and go tool cover should use non-atomic coverage under race detector. Otherwise we get both extremely high slowdown, little benefit from checking all coverage operations and very high false negative rate from race detector (executing the same code synchronizes with each other). But this will require changing both go compiler, go tool and bazel...
cc @dean-deng

dvyukov · 2021-01-15T09:56:46Z

I think this can be considered fixed:

syzkaller/sys/targets/targets.go

Lines 83 to 114 in 65a7a85

    
           // Timeouts structure parametrizes timeouts throughout the system. 
        
           // It allows to support different operating system, architectures and execution environments 
        
           // (emulation, models, etc) without scattering and duplicating knowledge about their execution 
        
           // performance everywhere. 
        
           // Timeouts calculation consists of 2 parts: base values and scaling. 
        
           // Base timeout values consist of a single syscall timeout, program timeout and "no output" timeout 
        
           // and are specified by the target (OS/arch), or defaults are used. 
        
           // Scaling part is calculated from the execution environment in pkg/mgrconfig based on VM type, 
        
           // kernel build type, emulation, etc. Scaling is specifically converged to a single number so that 
        
           // it can be specified/overridden for command line tools (e.g. syz-execprog -slowdown=10). 
        
           type Timeouts struct { 
        
           	// Base scaling factor, used only for a single syscall timeout. 
        
           	Slowdown int 
        
           	// Capped scaling factor used for timeouts other than syscall timeout. 
        
           	// It's already applied to all values in this struct, but can be used for one-off timeout values 
        
           	// in the system. This should also be applied to syscall/program timeout attributes in syscall descriptions. 
        
           	// Derived from Slowdown and should not be greater than Slowdown. 
        
           	// The idea behind capping is that slowdown can be large (10-20) and most timeouts already 
        
           	// include some safety margin. If we just multiply them we will get too large timeouts, 
        
           	// e.g. program timeout can become 5s*20 = 100s, or "no output" timeout: 5m*20 = 100m. 
        
           	Scale time.Duration 
        
           	// Timeout for a single syscall, after this time the syscall is considered "blocked". 
        
           	Syscall time.Duration 
        
           	// Timeout for a single program execution. 
        
           	Program time.Duration 
        
           	// Timeout for "no output" detection. 
        
           	NoOutput time.Duration 
        
           	// Limit on a single VM running time, after this time a VM is restarted. 
        
           	VMRunningTime time.Duration 
        
           	// How long we should test to get "no output" error (derivative of NoOutput, here to avoid duplication). 
        
           	NoOutputRunningTime time.Duration 
        
           }

dvyukov added the enhancement label Dec 19, 2019

dvyukov mentioned this issue Dec 19, 2019

Mips64ler2 support #1536

Merged

dvyukov mentioned this issue Jun 23, 2020

ipc: fix endianness issues #1854

Merged

dvyukov mentioned this issue Sep 30, 2020

dashboard/config: fix util.sh, and update KCSAN config #2148

Closed

dvyukov closed this as completed Jan 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

executor: tune timeouts based on VM/OS/arch/etc #1552

executor: tune timeouts based on VM/OS/arch/etc #1552

dvyukov commented Dec 19, 2019

kaartine commented Dec 20, 2019

kaartine commented Dec 20, 2019

xairy commented Dec 20, 2019

dvyukov commented Jul 6, 2020

dvyukov commented Dec 14, 2020 •

edited

Loading

dvyukov commented Jan 15, 2021

executor: tune timeouts based on VM/OS/arch/etc #1552

executor: tune timeouts based on VM/OS/arch/etc #1552

Comments

dvyukov commented Dec 19, 2019

kaartine commented Dec 20, 2019

kaartine commented Dec 20, 2019

xairy commented Dec 20, 2019

dvyukov commented Jul 6, 2020

dvyukov commented Dec 14, 2020 • edited Loading

dvyukov commented Jan 15, 2021

dvyukov commented Dec 14, 2020 •

edited

Loading