Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-initialise Netty worker group on plugin restart #289

Conversation

praseodym
Copy link
Contributor

This allows the plugin to actually recover from exceptions after a
restart. It also has the side effect of providing nicer error messages
and clearer stack traces to the end user.

Closes #268:

...
[2018-01-24T23:02:17,166][INFO ][org.logstash.beats.Server] Starting server on port: 5044
[2018-01-24T23:02:23,386][ERROR][logstash.pipeline        ] A plugin had an unrecoverable error. Will restart this plugin.
  Pipeline_id:main
  Plugin: <LogStash::Inputs::Beats port=>5044, id=>"930bf638e61e22156ab3a5029e0060c6affa3c8e14b1c217aa4bbfa3c896ec74", enable_metric=>true, codec=><LogStash::Codecs::Plain id=>"plain_a103340e-59bd-4d83-a089-fce6a9404307", enable_metric=>true, charset=>"UTF-8">, host=>"0.0.0.0", ssl=>false, ssl_verify_mode=>"none", include_codec_tag=>true, ssl_handshake_timeout=>10000, tls_min_version=>1, tls_max_version=>1.2, cipher_suites=>["TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384", "TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384", "TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256", "TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256"], client_inactivity_timeout=>60, executor_threads=>64>
  Error: Address already in use
  Exception: Java::JavaNet::BindException
  Stack: sun.nio.ch.Net.bind0(Native Method)
sun.nio.ch.Net.bind(sun/nio/ch/Net.java:433)
sun.nio.ch.Net.bind(sun/nio/ch/Net.java:425)
sun.nio.ch.ServerSocketChannelImpl.bind(sun/nio/ch/ServerSocketChannelImpl.java:223)
io.netty.channel.socket.nio.NioServerSocketChannel.doBind(io/netty/channel/socket/nio/NioServerSocketChannel.java:128)
io.netty.channel.AbstractChannel$AbstractUnsafe.bind(io/netty/channel/AbstractChannel.java:558)
io.netty.channel.DefaultChannelPipeline$HeadContext.bind(io/netty/channel/DefaultChannelPipeline.java:1283)
io.netty.channel.AbstractChannelHandlerContext.invokeBind(io/netty/channel/AbstractChannelHandlerContext.java:501)
io.netty.channel.AbstractChannelHandlerContext.bind(io/netty/channel/AbstractChannelHandlerContext.java:486)
io.netty.channel.DefaultChannelPipeline.bind(io/netty/channel/DefaultChannelPipeline.java:989)
io.netty.channel.AbstractChannel.bind(io/netty/channel/AbstractChannel.java:254)
io.netty.bootstrap.AbstractBootstrap$2.run(io/netty/bootstrap/AbstractBootstrap.java:364)
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(io/netty/util/concurrent/AbstractEventExecutor.java:163)
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(io/netty/util/concurrent/SingleThreadEventExecutor.java:403)
io.netty.channel.nio.NioEventLoop.run(io/netty/channel/nio/NioEventLoop.java:463)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(io/netty/util/concurrent/SingleThreadEventExecutor.java:858)
io.netty.util.concurrent.FastThreadLocalRunnable.run(io/netty/util/concurrent/FastThreadLocalRunnable.java:30)
java.lang.Thread.run(java/lang/Thread.java:748)
[2018-01-24T23:02:24,389][INFO ][org.logstash.beats.Server] Starting server on port: 5044
...

}

public void enableSSL(SslSimpleBuilder builder) {
sslBuilder = builder;
}

public Server listen() throws InterruptedException {
workGroup = new NioEventLoopGroup();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@praseodym maybe we should make sure that if workGroup is != null before this line, we will shutdown the "old" worker group first so we don't leak any theads here? I think it's not impossible to get to that situation since the reason we get to a reload is effectively always an Exception in the code that should shut this down right?

Copy link
Contributor Author

@praseodym praseodym Jan 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've added a non-null check with worker group shutdown and a null check at plugin shutdown. Reloads will mostly be caused by failing to listen, which leaves the worker group in a dead state (which is why reloading never worked), so I presume that the non-null shutdown will be mostly a no-op.

This allows the plugin to actually recover from exceptions after a
restart. It also has the side effect of providing nicer error messages
and clearer stack traces to the end user.
@praseodym praseodym force-pushed the reinitialise-worker-group-on-restart branch from deb8225 to 336c2f1 Compare April 2, 2018 18:54
@praseodym
Copy link
Contributor Author

Rebased. Not sure why one of the Travis build jobs is failing.

@praseodym
Copy link
Contributor Author

@original-brownbear could you review or merge this PR?

@original-brownbear
Copy link
Contributor

@praseodym looks good, retriggered Travis and will merge if it goes green. Thanks!

@original-brownbear
Copy link
Contributor

@robbavey actually now that you're the owner here, can you merge this? :) (don't wanna interfere if I shouldn't :P)

@elasticsearch-bot
Copy link

Rob Bavey merged this into the following branches!

Branch Commits
master f853ce6

@robbavey
Copy link
Contributor

robbavey commented Jun 4, 2018

@praseodym LGTM - thanks for the contribution, and apologies for the delay.

@original-brownbear
Copy link
Contributor

@robbavey thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants