-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify Runner dynamic launching and reloading under RunnerList #7172
Conversation
libbeat/cfgfile/list.go
Outdated
return ok | ||
} | ||
|
||
// Start the given runner and add it to the list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on exported method RunnerList.Add should be of the form "Add ..."
libbeat/cfgfile/list.go
Outdated
return nil | ||
} | ||
|
||
// StopAll runners |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on exported method RunnerList.Stop should be of the form "Stop ..."
26a84a5
to
d975687
Compare
libbeat/cfgfile/list.go
Outdated
for hash, runner := range stopList { | ||
debugf("Stopping runner: %s", runner) | ||
delete(r.runners, hash) | ||
go runner.Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change in behaviour from previous implementation where it was synchronous. This could become a problem in Filebeat or at least lead to more errors. Before a new prospector can be started on an existing state of a file, the state must be set to Finish. If the shutdown happens in go routine, this overlap can become bigger. But I think it should still work.
libbeat/cfgfile/list.go
Outdated
func (r *RunnerList) Has(hash uint64) bool { | ||
r.mutex.Lock() | ||
defer r.mutex.Unlock() | ||
_, ok := r.runners[hash] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A read write mutex could be used here.
@@ -1,48 +0,0 @@ | |||
package cfgfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any replacement for these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to add the file to the commit 🤦♂️ pushed
|
||
hash, err := hashstructure.Hash(rawCfg, nil) | ||
if err != nil { | ||
// Make sure the next run also updates because some runners were not properly loaded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this case still handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope :/ I'll have a look and find a solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 4ea21f2
libbeat/cfgfile/list.go
Outdated
for h, runner := range r.runners { | ||
debugf("Stopping runner: %s", runner) | ||
delete(r.runners, h) | ||
runner.Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before runners were stopped in parallel. This is important to speed up the shutdown process especially with a large amount of modules or prospectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, will update the code to do that
libbeat/cfgfile/list_test.go
Outdated
|
||
func (r *runnerFactory) Create(x beat.Pipeline, c *common.Config, meta *common.MapStrPointer) (Runner, error) { | ||
config := struct { | ||
Id int64 `config:"id"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct field Id should be ID
b6727b3
to
18c4989
Compare
Thanks @ruflin for the thorough review, I think this is ready for a 2nd look :) |
87e2dbd
to
0d44b7e
Compare
I've added 7c8ac78 after our last discussion, that should cover cases were a Filebeat input fails to start due to unclosed states (it will keep retrying every 10s). It also unifies how autodiscover uses the cfgfile.List.Reload |
5f9ef83
to
7c8ac78
Compare
The RACE detector does not seem to be happy on Jenkins |
This change adds some overhead to autodiscover, as we need to hash all configs for each new event, but we obtain 2 great benefits: - We use the same code paths (Reload) for autodiscover and config reload. - Autodiscover becomes resilient to module/input initialization errors, specially in the case of Filebeat, failed starts will be retried.
7c8ac78
to
a31e2d6
Compare
fixed |
5d803c1
to
63c350a
Compare
|
||
err := a.runners.Reload(configs) | ||
|
||
// On error, make sure the next run also updates because some runners were not properly loaded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create a debug log entry for this err?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reload process logs errors already:
- https://github.com/elastic/beats/pull/7172/files/63c350a7ea841fd6f2b5bea7bea4debe1c0afdf5#diff-ab62770abd1b3493ace2c1a6c5426f0eR59
- https://github.com/elastic/beats/pull/7172/files/63c350a7ea841fd6f2b5bea7bea4debe1c0afdf5#diff-ab62770abd1b3493ace2c1a6c5426f0eR84
As this is a list of errors, I think logging it's better managed there, wdyt?
go runner.Stop() | ||
a.runners.Remove(hash) | ||
if a.runners.Has(hash) { | ||
delete(a.configs, hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can only 1 go routine at the time access a.configs
?
Where is the runner stopped now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
configs is only modified by the event loop (handleStart
and handleStop
), so there is only one goroutine using it. Start/Stop is now handled by runners.Reload
: https://github.com/elastic/beats/pull/7172/files/63c350a7ea841fd6f2b5bea7bea4debe1c0afdf5#diff-3b0ae03a54b2e31c58f8dcf308a6a7b8R127. This is the same code path used by configuration reload mechanism
Introduced by elastic#7172, some inputs/modules may modify the passed configuration, this results on a effective change on the hash of the config. Before this change, the reload process was taking a running config as a not running config. Resulting on a stop/start after the first run. Example (with some added debugging): First run (add 1 config): ``` 2018-06-01T02:08:20.184+0200 DEBUG [autodiscover] cfgfile/list.go:53 Starting reload procedure, current runners: 0 2018-06-01T02:08:20.185+0200 INFO cfgfile/list.go:143 map[type:docker containers:map[ids:[56322b70c8a3712494e559381c7b8a6ce62a6495e33630628e6624b75b5a7505]]] 2018-06-01T02:08:20.185+0200 INFO cfgfile/list.go:144 Hash: %!s(uint64=12172188027786936243) 2018-06-01T02:08:20.185+0200 DEBUG [autodiscover] cfgfile/list.go:71 Start list: 1, Stop list: 0 ``` Second run (add another config, the first one gets restarted): ``` 2018-06-01T02:08:20.185+0200 DEBUG [autodiscover] cfgfile/list.go:53 Starting reload procedure, current runners: 1 2018-06-01T02:08:20.185+0200 INFO cfgfile/list.go:143 map[paths:[/var/lib/docker/containers/56322b70c8a3712494e559381c7b8a6ce62a6495e33630628e6624b75b5a7505/*.log] docker-json:map[stream:all partial:true] type:docker containers:map[ids:[56322b70c8a3712494e559381c7b8a6ce62a6495e33630628e6624b75b5a7505]]] 2018-06-01T02:08:20.185+0200 INFO cfgfile/list.go:144 Hash: %!s(uint64=12741034856879725532) 2018-06-01T02:08:20.185+0200 INFO cfgfile/list.go:143 map[type:docker containers:map[ids:[a626e25679abd2b9af161277f1beee96c1bba6b9412771d17da7ebfacca640a7]]] 2018-06-01T02:08:20.185+0200 INFO cfgfile/list.go:144 Hash: %!s(uint64=7080456881540055745) 2018-06-01T02:08:20.185+0200 DEBUG [autodiscover] cfgfile/list.go:71 Start list: 2, Stop list: 1 ```
Introduced by #7172, some inputs/modules may modify the passed configuration, this results on a effective change on the hash of the config. Before this change, the reload process was taking a running config as a not running config. Resulting on a stop/start after the first run. Example (with some added debugging): First run (add 1 config): ``` 2018-06-01T02:08:20.184+0200 DEBUG [autodiscover] cfgfile/list.go:53 Starting reload procedure, current runners: 0 2018-06-01T02:08:20.185+0200 INFO cfgfile/list.go:143 map[type:docker containers:map[ids:[56322b70c8a3712494e559381c7b8a6ce62a6495e33630628e6624b75b5a7505]]] 2018-06-01T02:08:20.185+0200 INFO cfgfile/list.go:144 Hash: %!s(uint64=12172188027786936243) 2018-06-01T02:08:20.185+0200 DEBUG [autodiscover] cfgfile/list.go:71 Start list: 1, Stop list: 0 ``` Second run (add another config, the first one gets restarted): ``` 2018-06-01T02:08:20.185+0200 DEBUG [autodiscover] cfgfile/list.go:53 Starting reload procedure, current runners: 1 2018-06-01T02:08:20.185+0200 INFO cfgfile/list.go:143 map[paths:[/var/lib/docker/containers/56322b70c8a3712494e559381c7b8a6ce62a6495e33630628e6624b75b5a7505/*.log] docker-json:map[stream:all partial:true] type:docker containers:map[ids:[56322b70c8a3712494e559381c7b8a6ce62a6495e33630628e6624b75b5a7505]]] 2018-06-01T02:08:20.185+0200 INFO cfgfile/list.go:144 Hash: %!s(uint64=12741034856879725532) 2018-06-01T02:08:20.185+0200 INFO cfgfile/list.go:143 map[type:docker containers:map[ids:[a626e25679abd2b9af161277f1beee96c1bba6b9412771d17da7ebfacca640a7]]] 2018-06-01T02:08:20.185+0200 INFO cfgfile/list.go:144 Hash: %!s(uint64=7080456881540055745) 2018-06-01T02:08:20.185+0200 DEBUG [autodiscover] cfgfile/list.go:71 Start list: 2, Stop list: 1 ```
This PR unifies all uses of dynamic runner handling under a new
RunnerList
structure. It can receive a list of configs as the desired state and do all the needed changes (start/stopping) to transition to it.I plan to use this as part of #7028