Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abnormal Mget latency increase issue #3031

Open
ackerL opened this issue Oct 29, 2024 · 2 comments
Open

Abnormal Mget latency increase issue #3031

ackerL opened this issue Oct 29, 2024 · 2 comments
Labels
for: team-attention An issue we need to discuss as a team to make progress status: waiting-for-triage

Comments

@ackerL
Copy link

ackerL commented Oct 29, 2024

Bug Report

We own service A fleet with 500+ fleet capacity, the 500+ hosts leverage Lettuce client to access the Redis cluster(around 20shards, total 100 hosts). Recently we observe the anomalies caused by service fleet deployment(gradually deployment, ~20 host per round, each host deployment cost ~10mins). During the deployment, we find that the mget(emit from service A view)latency increased a lot(from 15ms to 20+ms).
image

Figure 1: Service A uses Lettuce to access Redis cluster and mget latency increase during the fleet deployment

image Figure 2: Mget latency increase from 15ms to 20ms during the fleet deployment

After checking service A log, especially for the lettuce log, we do not observe any anomalies. Currently we can not explain that why the service A fleet deployment will trigger the mget latency increase. The only variable is the fleet deployment.

Is there any clues that can help for the next step trouble shooting on the abnormal latency increase issue? Thanks

Current Behavior

Stack trace
// your stack trace here;

Input Code

Input Code
// your code here;

Expected behavior/code

Environment

  • Lettuce version(s): 5.3.X.
  • Redis version: 5.X

Possible Solution

Additional context

@tishun tishun added for: team-attention An issue we need to discuss as a team to make progress status: waiting-for-triage labels Oct 30, 2024
@tishun
Copy link
Collaborator

tishun commented Oct 30, 2024

The team will attempt to dig some more in this issue, but from the quick read that I did it would be extremely hard, close to impossible, to answer the question without a lot of more information being provided.

Latency spikes of 5ms is an extremely low threshold and could be caused by virtually any of the actors in the chain.

Unless you detect some difference in the way the driver behaves (by profiling it while this issue occurs and monitoring the traffic) we could only play a guessing game, which is not helpful for anyone.

@tishun tishun added status: waiting-for-feedback We need additional information before we can continue and removed for: team-attention An issue we need to discuss as a team to make progress status: waiting-for-triage labels Oct 30, 2024
@ackerL
Copy link
Author

ackerL commented Oct 31, 2024

Hi tishun@, thanks for your attention on the issue. Let me try to add more details on the issue and our suspicious.

Our suspicion is that the abnormal mget latency increase issue may be related to connection problems. In our use case, we utilize Lettuce as the Redis client and initialize both read and write connections. see the code snippet below:

When service A starts, it creates a read connection and a write connection. The redisURIs array contains all the URIs for the Redis nodes in the cluster. So in ideal case, one instance in Service A will create two connections to the Redis cluster.

// Step 1. create read connection
this.readClusterClient = RedisClusterClient.create(CLIENT_RESOURCES, redisURIs);
readClusterClient.setOptions(createClusterClientOptions());
this.readClusterConnection = readClusterClient.connect(new ByteArrayCodec());
readClusterConnection.setTimeout(readTimeoutMs, TimeUnit.MILLISECONDS);
readClusterConnection.setReadFrom(ReadFrom.NEAREST);


// Step 2. create write connection
this.writeClusterClient = RedisClusterClient.create(CLIENT_RESOURCES, redisURIs);
writeClusterClient.setOptions(createClusterClientOptions());
this.writeClusterConnection = writeClusterClient.connect(new ByteArrayCodec());
writeClusterConnection.setTimeout(writeTimeoutMs, TimeUnit.MILLISECONDS);
writeClusterConnection.setReadFrom(ReadFrom.MASTER);

During the deployment of service A, we observe the connected client metric in Redis service is not stable, but it shared the same pattern as mget latency.
image

T1: The start time of deployment
T2: The end time of deployment
We can find that during the deployment, when the connected clients dropped, the mget latency started to decrease; As the deployment progressed, the number of connected clients began to rise, the mget latency increased correspondingly.

Here there are two things i want to elaborate.

  1. Why the connected clients drop? -> It should be related to that the deployment will restart partial instances in Service A fleet, around 60 hosts one time, thus, when the instances are shut down, the connection to the redis serve will reduce.
  2. Why the connected clients increase? -> It should be occur in the service startup and the instances of service A need to setup the connection again. We monitor one instance(different use cases could be different) in Service A, the connected clients to one redis node, changed from 4(before deployment) -> 0(service shutdown) -> 1(service start up) -> 2(service in running status). Currently we are not sure the connection changed from 1 -> 2 instead of creating 2 directly when the service starts up.

I believe there are some thing wrong with the connection in the step #1 and step #2 above, An interesting observation is that some clients established an excessive number of connections to a single Redis server. For instance, the client with IP address 10.117.154.244 has five active connections to one Redis server.
we execute ./redis-cli CLIENT LIST|grep 10.117.154.244 on the redis node

id=118437258 addr=10.117.154.244:36328 fd=2093 name= age=166473 idle=453 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=mget
id=124013924 addr=10.117.154.244:38930 fd=524 name= age=111 idle=111 flags=r db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=readonly
id=123908775 addr=10.117.154.244:33272 fd=925 name= age=3389 idle=3389 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=NULL
id=123912325 addr=10.117.154.244:50334 fd=44 name= age=3284 idle=3284 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=NULL
id=123991268 addr=10.117.154.244:38712 fd=1365 name= age=825 idle=825 flags=r db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=readonly

The output shows multiple connections from this client, which is concerning, as we expect a single client machine to have no more than two connections to a Redis server. The high number of connections from certain clients is likely degrading the performance of the Redis server, which in turn is increasing the mget latency for service A.

We want to get some insights that:

  1. In which scenarios that Lettuce might create an excessive number of connected clients to a single Redis server?
  2. What impact does having too many connections from a client on one Redis node? More connections will result in the efficiency of executing commands i believe.
  3. How can we prevent the situation where too many connected clients occur? limit to 2 connections from instance of Service A to one Redis node. Currently, we are using version 5.3.X of Lettuce. I am not sure if there is some hidden bug in the legacy version, especially related to the connection.

Additionally, we have verified key metrics on the Redis server and found no anomalies:

  • CPU usage remains low (<5%).
  • Redis memory utilization is low (no replication occurred, used memory/total allocated memory < 0.3).
  • The traffic pattern for mget remains unchanged.
  • We do not observe any network anomalies (<0.035MB per minute).

@tishun tishun added for: team-attention An issue we need to discuss as a team to make progress status: waiting-for-triage and removed status: waiting-for-feedback We need additional information before we can continue labels Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for: team-attention An issue we need to discuss as a team to make progress status: waiting-for-triage
Projects
None yet
Development

No branches or pull requests

2 participants