Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event Loop shutdown stuck due to uncompleted connect #192

Closed
franz1981 opened this issue Feb 17, 2023 · 4 comments
Closed

Event Loop shutdown stuck due to uncompleted connect #192

franz1981 opened this issue Feb 17, 2023 · 4 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@franz1981
Copy link
Contributor

franz1981 commented Feb 17, 2023

Reproducer:

public class IOUringClientSocketConnectionShortTimeoutTest extends AbstractClientSocketTest {

   @BeforeAll
   public static void loadJNI() {
      assumeTrue(IOUring.isAvailable());
   }

   @Test
   @Timeout(value = 30000, unit = TimeUnit.MILLISECONDS)
   public void testConnectTimeoutAndClose(TestInfo testInfo) throws Throwable {
      run(testInfo, (bootstrap) -> {
         testFailedConnectWithSuperShortTimeout(bootstrap);
         bootstrap.config().group().shutdownGracefully().sync();;
      });
   }

   public void testFailedConnectWithSuperShortTimeout(Bootstrap cb) throws Throwable {
      cb.handler(new SimpleChannelInboundHandler<Object>() {
         @Override
         public void channelRead0(ChannelHandlerContext ctx, Object msgs)  {
            // Nothing will be sent.
         }
      });
      cb.option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 1);
      ChannelFuture future = cb.connect("198.51.100.254", 65535);
      try {
         assertThat(future.await(30000), is(true));
      } finally {
         Assert.assertTrue(future.cause() instanceof ConnectTimeoutException);
      }
   }

   @Override
   protected List<TestsuitePermutation.BootstrapFactory<Bootstrap>> newFactories() {
      return IOUringSocketTestPermutation.INSTANCE.clientSocket();
   }
}

This test will hang forever (30 sec thanks to the test timeout) after requesting the group to shutdown gracefully, due to:

"testsuite-io_uring-worker-11-1@3031" daemon prio=5 tid=0xf nid=NA runnable
  java.lang.Thread.State: RUNNABLE
	  at io.netty.incubator.channel.uring.Native.ioUringEnter(Native.java:-1)
	  at io.netty.incubator.channel.uring.IOUringSubmissionQueue.submit(IOUringSubmissionQueue.java:254)
	  at io.netty.incubator.channel.uring.IOUringSubmissionQueue.submitAndWait(IOUringSubmissionQueue.java:242)
	  at io.netty.incubator.channel.uring.IOUringEventLoop.cleanup(IOUringEventLoop.java:344)
	  at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:1044)
	  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	  at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	  at java.lang.Thread.run(Thread.java:750)

This will mimic what we're experiencing in the Vert-x test suite due to a very similar test (and few others, related server side, but still due to pending CQEs that never happen).

@cescoffier @vietj

@franz1981 franz1981 added the bug Something isn't working label Feb 17, 2023
@franz1981 franz1981 changed the title Event Loop stuck due to channels not removed/close event completed Event Loop shutdown stuck due to uncompleted CQEs Feb 17, 2023
@franz1981
Copy link
Contributor Author

franz1981 commented Feb 17, 2023

It seems that it cannot get out of the cleanup state because

17:17:30.517 [testsuite-io_uring-worker-11-1] WARN  io.netty.channel.AbstractChannel - Can't invoke task later as EventLoop rejected it
java.util.concurrent.RejectedExecutionException: event executor terminated
	at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:934)
	at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:351)
	at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:344)
	at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:836)
	at io.netty.util.concurrent.SingleThreadEventExecutor.execute0(SingleThreadEventExecutor.java:827)
	at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:817)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.invokeLater(AbstractChannel.java:1024)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.deregister(AbstractChannel.java:804)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.fireChannelInactiveAndDeregister(AbstractChannel.java:764)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:747)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:620)
	at io.netty.incubator.channel.uring.AbstractIOUringChannel$AbstractUringUnsafe.forceClose(AbstractIOUringChannel.java:357)
	at io.netty.incubator.channel.uring.AbstractIOUringChannel$AbstractUringUnsafe.processDelayedClose(AbstractIOUringChannel.java:471)
	at io.netty.incubator.channel.uring.IOUringEventLoop.handle(IOUringEventLoop.java:278)
	at io.netty.incubator.channel.uring.UserData.decode(UserData.java:30)
	at io.netty.incubator.channel.uring.IOUringCompletionQueue.process(IOUringCompletionQueue.java:90)
	at io.netty.incubator.channel.uring.IOUringEventLoop.cleanup(IOUringEventLoop.java:345)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:1044)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:750)

a very late/delayed close complete and try to invokeLater a channel deregistration when the event loop won't accept any new tasks anymore.

@franz1981
Copy link
Contributor Author

franz1981 commented Feb 17, 2023

The whole process seems:

  1. a connect request with timeout is issued and a connect SQE is enqueued
  2. a timeout SQE can be enqueued (depends how distant we are from the connection timeout specified)
  3. timeout CQEs completes, if issued (and handled correctly)
  4. the timeout task handling cause a channel close request to start, but cannot forcibly close the channel, because there's an uncompleted connect request. It turns the close request into a delayed close (to be handled in the future, when there are no more write/read/connect in flight)
  5. the timeout handling return back the connect (with timeout) request to the caller, that proceed shutting down the executor service (it receives a connect timeout exception although the channels hasn't been removed yet!)
  6. a graceful shutdown request is issued, and the event loop try to close all channels (one, in this case) without success, again, because there's still an uncompleted connect. It keep on trying in a loop, until the shutdown quiet period expires
  7. when the shutdown quiet period is passed, given that there are no more tasks added, the event loop can exit and initiate the cleanup
  8. event loop state is shutdown
  9. cleanup tries again to close all channels still without success and wait till any (connect) completion happens
  10. connection completion finally arrive after 2 mins 10 secs (still investigating why, but I suppose is a TCP OS setting)
  11. event loop is shutdown and channel deregister's invokeLater fail to submit the actual doDeregister task, rejected
  12. cleanup is now stucked awaiting a CQE that cannot arrive anymore, because already happened, unable to remove the last channel

Now gotta check what happen on epoll and where's the problem here @normanmaurer
@chrisvest as well if he want to check if this happen on Netty 5 as well ;)

@franz1981 franz1981 changed the title Event Loop shutdown stuck due to uncompleted CQEs Event Loop shutdown stuck due to uncompleted connect Feb 18, 2023
@franz1981
Copy link
Contributor Author

By looking how Netty 5 works I think it's not affected, but I'll write a test on it to get used to new logic (that I love, BTW, wow! - sorry for arriving late)

@franz1981
Copy link
Contributor Author

Fixed by #193

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants