Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netty direct memory leak from buffer #18324

Closed
jiacheliu3 opened this issue Oct 25, 2023 · 0 comments · Fixed by #18323
Closed

Netty direct memory leak from buffer #18324

jiacheliu3 opened this issue Oct 25, 2023 · 0 comments · Fixed by #18323
Labels
type-bug This issue is about a bug

Comments

@jiacheliu3
Copy link
Contributor

jiacheliu3 commented Oct 25, 2023

Alluxio Version:
305

Describe the bug
We observe leak in direct memory buffer in the worker jvm like:

2023-10-23 06:58:15,873 ERROR [data-server-tcp-socket-worker-8](RPCMessageDecoder.java:48) - Error in decoding message.
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 4194304 byte(s) of direct memory (used: 10733224215, max: 10737418240)
        at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:845)
        at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:774)
        at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:701)
        at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:676)
        at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215)
        at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:197)
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:139)
        at io.netty.buffer.PoolArena.reallocate(PoolArena.java:302)
        at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:122)
        at io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:305)
        at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:280)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1103)
        at io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:105)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:288)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.lang.Thread.run(Thread.java:750)

The size of JVM/direct memory cap is irrelevant, we observe similar OOM with different sizes. For this 1G direct memory JVM, we observe 1G direct memory is used up after reading two 16MB files for several times. The sizes we read are probably less than 1G. This is very reproducible.

Below config is proven to directly lead to the leak:

# Use this mode, so the worker cache is copied into the read buffer to serve the read request
# the buffer is direct memory so you reproduce the leak
alluxio.worker.network.netty.file.transfer=MAPPED

If you do NOT use this MAPPED mode, there is no leak observed.

To Reproduce
See above

Expected behavior
A clear and concise description of what you expected to happen.

Urgency
Describe the impact and urgency of the bug.

Are you planning to fix it
Y

Additional context
Add any other context about the problem here.

@jiacheliu3 jiacheliu3 added the type-bug This issue is about a bug label Oct 25, 2023
alluxio-bot pushed a commit that referenced this issue Oct 27, 2023
### What changes are proposed in this pull request?

resolves #18324

Disclaimer: I might have monkey-typed this fix but I still do not know anything about buffer ref counting. This fix does NOT make me the owner of this state machine.

			pr-link: #18323
			change-id: cid-eb5bde353c08d3d9bdd39da5b9caf13681bae495
ssz1997 pushed a commit to ssz1997/alluxio that referenced this issue Dec 15, 2023
### What changes are proposed in this pull request?

resolves Alluxio#18324

Disclaimer: I might have monkey-typed this fix but I still do not know anything about buffer ref counting. This fix does NOT make me the owner of this state machine.

			pr-link: Alluxio#18323
			change-id: cid-eb5bde353c08d3d9bdd39da5b9caf13681bae495
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant