-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure proper shutdown of libbeat #1075
Conversation
Could not reopen previous PR due to forced push |
Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run; then say 'jenkins, test it'. |
@ruflin No, I doubt it will help fix this race issue :( |
return false | ||
} | ||
ch <- msg | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this channel might block. Why remove the Done channel here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it should have the same behavior with or without this case <-done
because here we make sure to output all events received.
The publisher being closed, balance forwardEvent()
will not be called once we reach WGevent.Wait()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
problem is, if output is unresponsive (e.g. infinite retry + network issues), the publisher will basically hang on shutdown doing no progress.
485186e
to
9cc4581
Compare
…down) Libbeat publisher now exits once all async events are published, all events outputed and ConnectionModes closed. * Move WorkerSignal to libbeat/common: WorkerSignals are used in packages publisher and outputs * Add a second WaitGroup to WorkerSignal. WaitGroups are used to track events and shutdown of Go routines * Add methods in WorkerSignal that wrap WaitGroups methods call * Remove TestMessageWorkerStopQueue and stopQueue function in WorkerSignal: events are not considered failed at shutdown anymore * Add tests for published events after shutdown * Add method Close to outputs/Outputer. The method is called by publisher/outputWorker onStop() * Edit CHANGELOG
9cc4581
to
7f16e11
Compare
Squashed and rebased after review changes |
LGTM. Problem with this approach is, beats won't shutdown if output is unresponsive. But this is already the case, no regression here. In future I'd like to have publisher.Client return a new client instance connecting to the publisher pipeline, which is closable. So beats startup goes like:
On shutdown we can disconnect from pipeline, so publisher pipeline and beats can shutdown in parallel. For filebeat the |
Ensure proper shutdown of libbeat
@urso Thanks for giving your insight. If I understand correctly, with a feature such as a persistent queue, a good solution would be to save unpublished events to disk at shutdown. A persistent queue would certainly help when getting stuck because of unresponsive output. But is it be necessary do like this 100% of the time? What if:
The thing would be to measure the responsiveness of the output to know if we are in situation 1 or 2. |
I think I tried this today and it seems like it could work well :). |
The |
Persistent queue is not about shutdown only, but more about resiliency. Having configured a persistent queue all events must pass the queue and only once ACKed by the queue, events will be published (with send-at-leat-once semantics). Thing is, there is no paticular need to add too much logic for handling shutdown. For most beats it's fine to drop data. Beats not being fine with event loss (e.g. filebeat) remember where they left off and can restart from last location. No problem if queue is dropped. Still it's of interest to signal 'shutting' down events to filebeat to do one last registry update.
I won't even try. Just drop events in pipeline and continue where left off. Filebeat/Winlogbeat or persistent queue remember where they left off and resend events not ACKed by output plugin.
Beats can already provide callback to PublishEvent (type output.Signaler) for getting Success/Fail events. This is used by filebeat async publisher mode:
Current problem is I can not distinguish between failure (which should only be possible during shutdown) or shutdown. Adding a shutdown callback we could do a 'panic' in 'Failed' handler. Will be in vacation for next 2 weeks. Let's continue discussion thereafter. |
Libbeat publisher now exits once all async events are published, all
events outputed and ConnectionModes closed.
failed at shutdown anymore