-
Notifications
You must be signed in to change notification settings - Fork 975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for disconnect on timeout to recover early from no RST
packet failures
#2082
Comments
@wpstan You can't |
@wpstan keepalive is only valid for |
Thanks a lot for the detailed analysis and write up. I wasn't aware of the retransmit vs keepalive priority relationship. With a blocking I/O client like Jedis is makes a lot of sense to discard connections that have timed out as the structural read is initiated by the method that is being called. Otherwise, a late data reception (caused by a slow server response) leads to protocol desynchronization. Lettuce has a command timeout mechanism, too. However, the internal architecture allows us to continue operations even in the case a command is timing out (mostly to protected the caller) as the command response parsing is event-driven. Disconnecting the connection upon command timeout is a pretty drastic approach that can work in certain scenarios in which you cannot customize retransmit parameters. With the current configuration means (new connection events, customization of the Netty bootstrap, command listeners) you have all the necessary bits to implement such a behavior within your application. If we learn that a larger part of our community is interested in such a feature out of the box, then we will consider such an enhancement. |
yes, adjusting parameters such as
In many of our customers, because of this issue, we need documentation explaining why and
I personally also really like the Can you guide me to enhance this if you don't have time to spare, thanks. |
RST
packet failures
Indeed, |
Agreed, but I think a better strategy is to reconnect after X (1 by default)
|
I agree that it might not make sense to disconnect after the first timeout. The tricky part is to detect the right state. Let's assume a disconnect after One might have a command sequence of We have in other places a delay where we debounce events and delay activity, such as adaptive topology updates in Redis cluster where not every In any case, such a feature requires a bit more thought. |
I mean from X |
@mp911de What do you think of this strategy? If agreed, I will prepare a PR. |
@yangbodong22011 How's the PR going :) We need this mechanism badly. |
Waiting for @mp911de to have time to process it, we don't have a firm strategy yet. |
Any update? |
PingConnectionHandler will periodically send PING commands to Redis Server, and decide whether to reconnect based on whether it fails (the current strategy is to reconnect after three consecutive failures) This is essentially because KeepAlive that only relies on TCP is unreliable, for details, please refer to: redis#2082
@yangbodong22011 Hi, The PR #2253 is closed. So can you provide a custom Spring Boot starter if you have time. We really need this connection validation mechanism. Compare to switch driver to Jedis, a Spring Boot starter is more easy to use for the users of Lettuce. |
Hi, now it seems that keep-alive is a better measure for Lettuce. Can refer to this comments: #2253 (comment) |
But keep-alive does not solve the problem raised by this issue, which is why we have kept it open. If #2253 is not what we want, I think we need to continue to communicate and discuss to completely solve this problem. |
@mp911de I think you should consider the resolution seriously. many people are deeply troubled by this problem. I also found we had many discussions about this problem, but no further solution was packed by our lettuce team. We really have no choice but to let our customers to chose Jedis for a better network shrink problem. @yangbodong22011 |
PingConnectionHandler will periodically send PING commands to Redis Server, and decide whether to reconnect based on whether it fails (the current strategy is to reconnect after three consecutive failures) This is essentially because KeepAlive that only relies on TCP is unreliable, for details, please refer to: redis#2082
Refer #1428 , we resolve this problem by add keep-alive and tcp_user_timeout options in our client configuration. Keep-alive wouldn't work in a situation that client send request to server continuously,but server never ack. You need add netty-transport-native-epoll to your dependencies Gradle deps: api 'io.netty:netty-transport-native-epoll:4.1.65.Final:linux-x86_64'
api 'io.netty:netty-transport-native-epoll:4.1.65.Final:linux-aarch_64' Java code example: // customize your netty
ClientResources clientResources = ClientResources.builder()
.nettyCustomizer(new NettyCustomizer() {
@Override
public void afterBootstrapInitialized(Bootstrap bootstrap) {
if (EpollProvider.isAvailable()) {
// TCP_USER_TIMEOUT >= TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT
// https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
bootstrap.option(EpollChannelOption.TCP_USER_TIMEOUT, tcpUserTimeout);
}
}
})
.build();
// create your socket options
SocketOptions socketOptions = SocketOptions.builder()
.connectTimeout(connectTimeout)
.keepAlive(SocketOptions.KeepAliveOptions.builder()
.enable()
.idle(Duration.ofSeconds(5))
.interval(Duration.ofSeconds(1))
.count(3)
.build())
.build();
|
@richieyan Thanks for your comments and code, I did some tests and here are the results: The following tests have the following prerequisites:
To reproduce this test, need to pay attention to:
summary1, TCP_USER_TIMEOUT can indeed solve the problem of this issue on the Linux platform. |
Hi @yangbodong22011 @richieyan Hope you are doing all well! My test step:
After primary restarts, Lettuce can reconnect to replica. But the reconnect time, I think it should be controlled by TCP_USER_TIMEOUT. I don't know why the reconnection time is 10s constantly. |
@chuckz1321 Your reconnection is not due to no RST return. Obviously, your Server sent a If you want to simulate this problem, you need to use the iptables command mentioned in the top comment. |
Thanks for your feedback. Once the connection is broken, it seems netty awares of this firstly. And following Redis command will request another connection. io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out 2023-04-12 08:40:05.927 INFO 44928 --- [nio-9888-exec-9] com.azure.redis4jedis.TestController : start to query from redis1681288805927 Any insights? |
@chuckz1321 This issue does not apply if the client machine is on If the client machine is already on Linux, please confirm the size of the tcp_retries2 parameter.
|
@yangbodong22011 If I leave all the configuration default, I can reproduce 15 times retrans |
@huaxne Set TCP_USER_TIMEOUT,see #2082 (comment) @mp911de Would you consider adding a TCP_USER_TIMEOUT config to Lettuce to fix this, I can contribute a PR. |
@yangbodong22011 Adding |
okay, I will prepare a PR. |
Snapshots are deployed at https://oss.sonatype.org/content/repositories/snapshots/io/lettuce/lettuce-core/6.3.0.BUILD-SNAPSHOT/lettuce-core-6.3.0.BUILD-20230901.094627-49.jar, also available for |
@mp911de I verified
my pom.xml is (notice: Linux need use native-epoll)
in debug, will see this log:
|
Usually, |
@mp911de Hello, when do you expect to release the official version (non-SNAPSHOT), we need to notify users to upgrade to this version. |
I updated the release date to November 14, 2023. |
Hello, Is there a change in the release time? We also need to notify users to upgrade the version, considering that we also have a large number of users waiting for this version. |
Yeah, Project Reactor has released just yesterday in the night so our release has slipped to today. |
@mp911de Will this updated lettuce (6.3.0) be available in any of the upcoming spring data redis 3.1.x releases |
No, because Spring upgrades only to bugfix releases in their bugfix release. You can in any case upgrade the version yourself as Lettuce 6.3 can be used as drop-in replacement. Generally, upgrading to a newer Lettuce version works better than downgrading the Lettuce version. |
@mp911de @yangbodong22011 maybe use spring boot provide way, @Slf4j
@Component
public class LettuceConfig implements InitializingBean {
@Autowired
private RedisConnectionFactory redisConnectionFactory;
@Override
public void afterPropertiesSet() {
if (redisConnectionFactory instanceof LettuceConnectionFactory c) {
c.setValidateConnection(true);
}
}
}
|
@wayn111 |
@2luckydog check you log, Does it contain “Validation of shared connection failed; Creating a new connection.”? |
@wayn111 This content is not included. |
@wayn111 |
The detailed solution is as follows
<<dependencies>
<dependency>
<groupId>io.lettuce</groupId>
<artifactId>lettuce-core</artifactId>
<version>6.3.0.RELEASE</version>
</dependency>
<dependency>
<groupId>io.netty</groupId>
<artifactId>netty-transport-native-epoll</artifactId>
<version>4.1.100.Final</version>
<classifier>linux-x86_64</classifier>
</dependency>
</dependencies>
import io.lettuce.core.ClientOptions;
import io.lettuce.core.RedisClient;
import io.lettuce.core.RedisURI;
import io.lettuce.core.SocketOptions;
import io.lettuce.core.SocketOptions.KeepAliveOptions;
import io.lettuce.core.SocketOptions.TcpUserTimeoutOptions;
import io.lettuce.core.api.StatefulRedisConnection;
import io.lettuce.core.api.sync.RedisCommands;
import java.time.Duration;
public class LettuceExample {
/**
* Enable TCP keepalive and configure the following three parameters:
* TCP_KEEPIDLE = 30
* TCP_KEEPINTVL = 10
* TCP_KEEPCNT = 3
*/
private static final int TCP_KEEPALIVE_IDLE = 30;
/**
* The TCP_USER_TIMEOUT parameter can avoid situations where Lettuce remains stuck in a continuous timeout loop during a failure or crash event.
* refer: https://github.com/lettuce-io/lettuce-core/issues/2082
*/
private static final int TCP_USER_TIMEOUT = 30;
private static RedisClient client = null;
private static StatefulRedisConnection<String, String> connection = null;
public static void main(String[] args) {
// Replace the values of host, user, password, and port with the actual instance information.
String host = "r-bp1s1bt2tlq3p1****.redis.rds.aliyuncs.com";
String user = "r-bp1s1bt2tlq3p1****";
String password = "Da****3";
int port = 6379;
// Config RedisURL
RedisURI uri = RedisURI.Builder
.redis(host, port)
.withAuthentication(user, password)
.build();
// Config TCP KeepAlive
SocketOptions socketOptions = SocketOptions.builder()
.keepAlive(KeepAliveOptions.builder()
.enable()
.idle(Duration.ofSeconds(TCP_KEEPALIVE_IDLE))
.interval(Duration.ofSeconds(TCP_KEEPALIVE_IDLE/3))
.count(3)
.build())
.tcpUserTimeout(TcpUserTimeoutOptions.builder()
.enable()
.tcpUserTimeout(Duration.ofSeconds(TCP_USER_TIMEOUT))
.build())
.build();
client = RedisClient.create(uri);
client.setOptions(ClientOptions.builder()
.socketOptions(socketOptions)
.build());
connection = client.connect();
RedisCommands<String, String> commands = connection.sync();
System.out.println(commands.set("foo", "bar"));
System.out.println(commands.get("foo"));
// If your application exits and you want to destroy the resources, call this method. Then, the connection is closed, and the resources are released.
connection.close();
client.shutdown();
}
} |
Bug Report
I'm one of the Jedis Reviewers and our customers are experiencing unrecoverable issues with Lettuce in production.
Lettuce connects to a Redis host and reads and writes normally. However, if the host fails (the hardware problem directly causes the shutdown, and there is no RST reply to the client at this time), the client will continue to time out until the tcp retransmission ends, and it can be recovered. At this time, it takes about 925.6 s in Linux ( Refer to tcp_retries2 ).
Why KeepAlive doesn't fix this
#1437 (Lettuce supports the option to set KEEPALIVE since version 6.1.0 )
Because the priority of the retransmission packet is higher than that of keepalive, before reaching the keepalive stage, it will continue to retransmit until it is reconnected.
In what scenario is this question sent?
How to reproduce this issue
Observe that the client starts timing out and cannot recover until after 925.6 s (related to tcp_retries2)
After the test, clear the iptables rules
How to fix this
We should provide the activation mechanism of the application layer, that is, on the underlying Netty link, periodically insert the activation data packet, if the activation data packet times out, the client will initiate a reconnection to recover quickly.
How Jedis avoids this problem
Jedis is a connection pool mode. When an API times out, Jedis will destroy the link and obtain it again from the connection pool, which can avoid the above problems.
Environment
The text was updated successfully, but these errors were encountered: