-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster - Moved/Ask Redirection Not Working Properly. #1693
Comments
Thanks for the report. I do have logic to invalidate the cache in the event we receive any Hopefully I can replicate the problem. |
Thank you Michael. I tried to replicate locally and was unsuccessful. The issue became apparent when I performed a small test migration on production. Maybe it has to do with the number of connected clients? It was difficult to emulate locally having hundreds of simultaneous clients. |
Turns out the boxes with the issue had 5.0.2 not 5.1.1. Could this be part of the problem? I didn't see anything in the change log that seemed pertinent. Thanks again, |
There were a couple of changes to the caching code between 5.0.2 and 5.1.1 although I don't recall if they would've been likely to prevent cache invalidation. This is good to know though, as I'll make sure to check if invalidation is broken in 5.0.2 and not in 5.1.1. Might get lucky and make investigation simpler! 😄 |
Thanks again. After another try with migrating a few slots, it appeared as if the issue arose on only some of the httpd processes and not all. Does that make any sense? i.e. Most requests had no issues finding the right slots, however a few would have problems contacting the cluster. |
Sure that makes sense. I assume it's a cache invalidation problem that only happens sometimes, which honestly means it might be trickier to track down. It's been a while since I've used apache. Are you using prefork or workers (in case I need to use a VM to replicate the problem). Also a |
It is on an Amazon Linux distribution, using prefork. Is the cluster cache shared among children? When the call fails, it is returning a (boolean) false. Thanks again for your help. |
No I think each fork will have its own Can you provide the output from |
Which sections do you need?
|
That should work. Last question, are you using persistent connections with // e.g. passing `true` to the persistent argument
$obj_cluster = new RedisCluster(NULL, ["host:7000", "host:7001"], 1.5, 1.5, true); |
Yes, persistent connections are being used. |
So, I did a bit of testing and was able to produce a similar result locally. I did it using jMeter and a local VM. It was tried with both 5.0.2 and 5.1.1 and each yielded similar results, with and without slot caching enabled. Most of my trials involved 50 -100 http threads and most often 100% of the threads would return a (boolean)false during the slot migration of a particular key. What I found is in almost every case, that the time during which a slot is being migrated, I am receiving a (boolean)false, when trying to retrieve a key from said slot. Below is a sample. It shows the Redis monitor output from the original node, then the output from my clients during the migration and then the Redis monitor output from the receiving node, once it has migrated completely. Noticed the microtime of the failures seem to align with the timestamps in the Redis monitor output, during migration. All of the failures seem to occur after the receiving node gets the RESTORE-ASKING command. After reviewing the code, it appears there are facilities to deal with ASK redirection, however I am not seeing it work in practice. Hopefully this information is of some use.
|
This is really useful, thanks! So maybe the problem is actually just that the I'm going to set up a test environment to replicate this issue this afternoon after work. Thanks again! |
Excellent... Thanks again, |
I have been doing some more digging. Turns out I am actually receiving (boolean)false back when I receive the MOVED error. If MOVED is received, should it attempt to get the key from the new node before returning false? Thanks, |
More notes: |
I was able to investigate the issue this evening and can replicate the problem. It does appear there is an issue with The good news is I can trigger the error with a simple cli script, which with any luck means the problem will be straightforward to solve. I'll try to get a fix up tomorrow or at the very least provide more information about what's actually happening. |
That sounds fantastic. Thanks again. |
Quick update. I've tracked down the issue and have a preliminary fix working. That said, I want to spend some more time with it to make sure the fix doesn't have breaking side effects. Again, mostly untested but here is the change. diff --git cluster_library.c cluster_library.c
index bce78bc..01e49bd 100644
--- cluster_library.c
+++ cluster_library.c
@@ -1206,8 +1206,7 @@ static int cluster_set_redirection(redisCluster* c, char *msg, int moved)
*
* This function will return -1 on a critical error (e.g. parse/communication
* error, 0 if no redirection was encountered, and 1 if the data was moved. */
-static int cluster_check_response(redisCluster *c, REDIS_REPLY_TYPE *reply_type
- )
+static int cluster_check_response(redisCluster *c, REDIS_REPLY_TYPE *reply_type)
{
size_t sz;
@@ -1232,7 +1231,7 @@ static int cluster_check_response(redisCluster *c, REDIS_REPLY_TYPE *reply_type
}
// Check for MOVED or ASK redirection
- if ((moved = IS_MOVED(inbuf)) || IS_ASK(inbuf)) { // Set our redirection information
+ if ((moved = IS_MOVED(inbuf)) || IS_ASK(inbuf)) {
/* We'll want to invalidate slot cache if we're using one */
c->redirections++;
@@ -1380,8 +1379,7 @@ static int cluster_sock_write(redisCluster *c, const char *cmd, size_t sz,
/* If in ASK redirection, get/create the node for that host:port, otherwise
* just use the command socket. */
if (c->redir_type == REDIR_ASK) {
- redis_sock = cluster_get_asking_sock(c);
- if (cluster_send_asking(redis_sock) < 0) {
+ if (cluster_send_asking(c->cmd_sock) < 0) {
return -1;
}
}
@@ -1627,10 +1625,13 @@ PHP_REDIS_API short cluster_send_command(redisCluster *c, short slot, const char
return -1;
}
- /* Update mapping if the data has MOVED */
if (c->redir_type == REDIR_MOVED) {
+ /* For MOVED redirection we want to update our cached mapping */
cluster_update_slot(c);
c->cmd_sock = SLOT_SOCK(c, slot);
+ } else if (c->redir_type == REDIR_ASK) {
+ /* For ASK redirection we want to redirect but not update slot mapping */
+ c->cmd_sock = cluster_get_asking_sock(c);
}
}
diff --git cluster_library.h cluster_library.h
index af69ad4..784087e 100644
--- cluster_library.h
+++ cluster_library.h
@@ -26,7 +26,7 @@
/* MOVED/ASK comparison macros */
#define IS_MOVED(p) (p[0]=='M' && p[1]=='O' && p[2]=='V' && p[3]=='E' && \
p[4]=='D' && p[5]==' ')
-#define IS_ASK(p) (p[0]=='A' && p[1]=='S' && p[3]=='K' && p[4]==' ')
+#define IS_ASK(p) (p[0]=='A' && p[1]=='S' && p[2]=='K' && p[3]==' ')
/* MOVED/ASK lengths */
#define MOVED_LEN (sizeof("MOVED ")-1) I'll do proper testing tomorrow and get a formal branch up. |
Seems to be working properly and not returning FALSE on MOVED and ASK... Thank you very much, |
Thanks for reporting it! I'm going to mock up a fake redis cluster script so I generate |
Awesome. Not to be a pain, but do you have an ETA? :>) -Ted |
I'm not sure about a specific ETA but it'll certainly be in the next official release. |
OK. As it dealt with a bug in a core feature of the cluster, I wasn't sure if it would warrant a more urgent bug release. I was under the impression its release was imminent. I was hopeful to get a 'blessed' version before performing a needed migration. Has the patch you pasted been tested formally, per your testing protocols? Trying to decide how to proceed. Thanks again, |
I think it warrants moving to create a new release with more urgency than before the bug was identified but I tend to be cautious about releasing a core modification without a decent amount of testing. I'll get the change into the |
Understood. Thank you for the update. Let me know if I can be of any assistance. |
Any updates on a bugfix release? Thanks again, |
Yes, sorry about that. This will be merged into develop today then we can cut a release candidate. |
Awesome... Thank you. |
Sorry about the delay. I've opened a PR with the fix here I also put together a mock redis cluster mechanism that I can instruct to randomly issue Also I think it's probably a decent idea to add perform some tests while actively resharding a cluster. |
Checking in to see if there are any updates on the 5.1.2 Milestone. Thanks, |
Yes, release candidate coming tomorrow. I'm just putting together the changelog. |
5.2.0RC1 is up (both on GitHub and PECL) so I'm going to close this issue. As always, if it's actually not fixed feel free to reopen. |
Expected behaviour
Running a cluster and using the slot cache. When I migrate keys I would expect the slot cache to update if it receives an ASK or MOVE.
Actual behaviour
Upon requesting the data from a key on a migrated slot, the cache does not seem to update and continually requests keys from the wrong node. If I restart apache the slot cache is refreshed and works fine.
Update: It appears ASK and MOVE redirects are not being handled properly.
I'm seeing this behaviour on
Steps to reproduce, backtrace or example script
I've checked
develop
branchThe text was updated successfully, but these errors were encountered: