Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

commandBuffer peforms expensive queue in channelInactive? #960

Closed
shrimptttttttttt opened this issue Jan 8, 2019 · 9 comments
Closed

commandBuffer peforms expensive queue in channelInactive? #960

shrimptttttttttt opened this issue Jan 8, 2019 · 9 comments
Labels
status: waiting-for-feedback We need additional information before we can continue

Comments

@shrimptttttttttt
Copy link

Hi, all
In branch below 4.5.x, channelInactive will add all queued command to command buffer. If the network is very unstable and qps/traffic is very high, the command buffer will become huge, and this object is can't be recycled by gc. Thus client server is full gc all the time. While I see you remove commandBuffer.addAll call in channelInactive. Is it because what I think? Thanks for you attention and look forward for your reply.

@mp911de
Copy link
Collaborator

mp911de commented Jan 8, 2019

Lettuce retains commands on disconnect to re-trigger these on reconnect. We applied some optimizations in command re-queueing, you might want to check out Lettuce 5 as well.

If that's what you're talking about, then everything works as designed, feel free to close this ticket. If you mean something different, then please elaborate.

@mp911de mp911de added the status: waiting-for-feedback We need additional information before we can continue label Jan 8, 2019
@shrimptttttttttt
Copy link
Author

shrimptttttttttt commented Jan 8, 2019

@Override
    public void channelInactive(ChannelHandlerContext ctx) throws Exception {

        if (debugEnabled) {
            logger.debug("{} channelInactive()", logPrefix());
        }

        try {
            writeLock.lock();
            setStateIfNotClosed(LifecycleState.DISCONNECTED);

            if (redisChannelHandler != null) {

                if (debugEnabled) {
                    logger.debug("{} deactivating channel handler", logPrefix());
                }

                setStateIfNotClosed(LifecycleState.DEACTIVATING);
                redisChannelHandler.deactivated();
            }

            setStateIfNotClosed(LifecycleState.DEACTIVATED);

            // Shift all commands to the commandBuffer so the queue is empty.
            // Allows to run onConnect commands before executing buffered commands
            commandBuffer.addAll(queue);
            queue.removeAll(commandBuffer);

        } finally {
            writeLock.unlock();
        }

        if (buffer != null) {
            rsm.reset();
            buffer.clear();
        }

        if (debugEnabled) {
            logger.debug("{} channelInactive() done", logPrefix());
        }
        super.channelInactive(ctx);
    }

Above is branch 3.5.x's code. We can see commandBuffer.addAll(queue) is called without queue size checked in method channelInactive. Because it's called without queue size checked, then commandBuffer can be very huge if channelInactive is being constantly called. But in branch 4.5.x's code, channelInactive won't call commandBuffer.addAll(queue) as belows

@Override
    public void channelInactive(ChannelHandlerContext ctx) throws Exception {

        if (debugEnabled) {
            logger.debug("{} channelInactive()", logPrefix());
        }

        if (channel != null && ctx.channel() != channel) {
            logger.debug("{} My channel and ctx.channel mismatch. Propagating event to other listeners.", logPrefix());
            super.channelInactive(ctx);
            return;
        }

        synchronized (stateLock) {
            try {
                lockWritersExclusive();
                setState(LifecycleState.DISCONNECTED);

                if (redisChannelHandler != null) {

                    if (debugEnabled) {
                        logger.debug("{} deactivating channel handler", logPrefix());
                    }

                    setState(LifecycleState.DEACTIVATING);
                    redisChannelHandler.deactivated();
                }

                rebuildQueue();
                setState(LifecycleState.DEACTIVATED);

                channel = null;
            } finally {
                unlockWritersExclusive();
            }
        }

        rsm.reset();

        if (debugEnabled) {
            logger.debug("{} channelInactive() done", logPrefix());
        }
        super.channelInactive(ctx);
    }
  1. I am wondering what is the reason you remove commandBuffer.addAll(queue) in channelInactive?

  2. Will commandBuffer.addAll(queue); call in method channelInactive version lower than 4.4.x cause heap size issue?

@mp911de
Copy link
Collaborator

mp911de commented Jan 8, 2019

When channelInactive is constantly called and the queue is large, then you have other issues to solve (e.g. solve overloading scenarios, remove the reason for regular disconnects)

Lettuce 4 has moved the queue rebuild into rebuildQueue(). This happens via Queue.poll and Queue.add.

Will commandBuffer.addAll(queue); … cause heap size issue?

Commands are not duplicated, they are retained until commands get canceled or executed. If you run out of memory because having too many queued commands, then you need to solve that issue first. Moving commands between queues requires additional memory which is a fraction of the already allocated command memory.

What is the actual question? What are you trying to solve?

@shrimptttttttttt
Copy link
Author

Sorry, I have to correct my question. I have this issue
My client server keep full gc all time. And I get the heap dump. And we found out that the ClientResource's transportBuffer holds 90w+'s Command.

And I am wondering why...

@shrimptttttttttt
Copy link
Author

And we stop the traffic, the client server is keeping full gc until OOM, and can't recover. Seems like the worker thread stop consuming the transportBuffer.

@mp911de
Copy link
Collaborator

mp911de commented Jan 8, 2019

I suggest limiting the request queue via ClientOptions (see https://github.com/lettuce-io/lettuce-core/wiki/Client-Options). Additionally, you can set disconnectedBehavior to reject commands while disconnected. Both settings will protect you from accepting too many command objects.

@shrimptttttttttt
Copy link
Author

shrimptttttttttt commented Jan 8, 2019

We have configured the ClientOptions max queue size as 4096, but the transportBuffer sitll holds 90w+ Commands. We use 3.5.0.Final, by the way.

And we also found out that we got ConcurrentModificationException as belows

java.util.ConcurrentModificationException: null
        at java.util.ArrayDeque.delete(ArrayDeque.java:554) ~[na:1.8.0_172]
        at java.util.ArrayDeque.removeFirstOccurrence(ArrayDeque.java:375) ~[na:1.8.0_172]
        at java.util.ArrayDeque.remove(ArrayDeque.java:752) ~[na:1.8.0_172]
        at com.lambdaworks.redis.protocol.CommandHandler.queueCommand(CommandHandler.java:428) ~[lettuce-3.5.0.Final.jar:na]
        at com.lambdaworks.redis.protocol.CommandHandler.writeSingleCommand(CommandHandler.java:372) ~[lettuce-3.5.0.Final.jar:na]
        at com.lambdaworks.redis.protocol.CommandHandler.write(CommandHandler.java:355) ~[lettuce-3.5.0.Final.jar:na]

We know the problem above is solved in 3.5.1. But I want figure out the reasons with version 3.5.0.Final.

  1. Why transportBuffer holds 90w+ Commands? As we have already configured max queue size as 4096? As transportBuffer holds too many Commands, our client server keeps full gc but can't get enough memory size recycled thus OOM.
  2. When we stop the traffic, our client server still can't recover. Seems like transportBuffer consuming thread stops working.
  3. Are 1 and 2 related to transportBuffer.remove failure?

I have some ideas and I'll share with you and hope you'll feedback.

  1. clientOptions max queue size check as belows have preconditions
if (commandBuffer.size() + queue.size() >= clientOptions.getRequestQueueSize()) {
                throw new RedisException("Request queue size exceeded: " + clientOptions.getRequestQueueSize()
                        + ". Commands are not accepted until the queue size drops.");
            }

The precondition is we need to remove from transportBuffer and add to the queue successfully.

private void writeSingleCommand(ChannelHandlerContext ctx, RedisCommand<K, V, ?> command, ChannelPromise promise)
            throws Exception {

        if (command.isCancelled()) {
            transportBuffer.remove(command);
            return;
        }

        queueCommand(command, promise);
        ctx.write(command, promise);
    }

But what if transportBuffer.remove(command); throws ConcurrentModificationException? As the code above shows it will throws an Exception and Command won't be added to the queue.

Thus the queue size check below can't work properly. With high load and mGet(1024 keys), the transportBuffer will hold too many Commands until OOM.

if (commandBuffer.size() + queue.size() >= clientOptions.getRequestQueueSize()) {
                throw new RedisException("Request queue size exceeded: " + clientOptions.getRequestQueueSize()
                        + ". Commands are not accepted until the queue size drops.");
            }

Above is my idea about the below question

  1. Why transportBuffer holds 90w+ Commands? As we have already configured max queue size as 4096? As transportBuffer holds too many Commands, our client server keeps full gc but can't get enough memory size recycled thus OOM.

As this question

  1. When we stop the traffic, our client server still can't recover. Seems like transportBuffer consuming thread stops working.

I think this may have some connection with this one
#788

Lettuce can't account for VM errors. If an Error is thrown, then the whole driver gets into an undefined state.

So the working thread stop consuming transportBuffer. And that's why we can't recover even if we stop the traffic.

Am I right?

Thank you for your help and look forward for your feedback.
Have a nice day.

@mp911de
Copy link
Collaborator

mp911de commented Jan 9, 2019

Lettuce 3.5 is out of maintenance for quite a long time and it's not supported anymore. Given the number of bugs we fixed around queue limitation in Lettuce 4 and 5, it's likely that these bugs still persist in Lettuce 3.5.

The actual problem with Lettuce 3.5 command queueing is that operations happen in different threads and that ArrayDeque is not bounded. This allows for conditions in which commands can be added although the queue should be bounded.

I strongly advise upgrading to Lettuce 5.

@mp911de
Copy link
Collaborator

mp911de commented Feb 7, 2019

Closing as there's nothing left to do here. If you would like us to look at this issue, please provide the requested information and we will re-open the issue.

@mp911de mp911de closed this as completed Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting-for-feedback We need additional information before we can continue
Projects
None yet
Development

No branches or pull requests

2 participants