-
Notifications
You must be signed in to change notification settings - Fork 625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEW] MULTI Idempotency Keys for safely retrying non-idempotent commands #1087
Comments
Thanks @byroot! This is definitely a pain point, and I think your proposal is a good solution: assigning a unique ID to each request (via the form of a MULTI transaction) and caching the result for a short period. The alternative of modifying the RESP protocol that you mentioned would indeed introduce implementation-level complexity. I agree that it's not ideal for two reasons: first, it brings compatibility risks, and second, I would guess that a large portion of the workload—especially commands like SET/GET—is already idempotent. Therefore, making this mitigation opt-in via this MULTI command extension is a smart approach to avoid unnecessary overhead. +@valkey-io/core-team to chime in. |
Cool idea. You would need to wrap all non-idempotent commands in MULTI then? It would be great to solve this problem. I think we should try to come up with more ways to solve it. If a protocol level feature could solve this for all commands, then I would even consider it as RESP4 (client opt-in via HELLO). As a start, we could mark all non-idempotent write commands as such in the COMMAND response. |
It would be purely optional, but if you wanted to be able to retry in all circumstances yes. But it would be easy, even for a "dumb client" like
Of course. I wanted to come up with a concrete proposal, so that it sparks a constructive discussion, but I'm not married to a specific solution. The reason I suggested this one in particular is that it's backward compatible.
Right, baking this in the protocol would indeed be way more elegant. The reason I'm not very enthusiastic about it is that my client have been among the first ones to require
So why not a |
OK, good point with the proxies. I wasn't aware that proxies need to understand the RESP protocol. Are those primarily for clusters? I guess a proxy still can't make a cluster appear like a standalone node, because cross-node commands can't be supported in any sane way. So, maybe you're right that extending MULTI with this new syntax would be the simplest way. A problem I can think of, regardless how it's done, is that the server can end up using a lot of memory for these cached replies in the worst case. Clients can be evicted, but these replies are not tied to a client id because when a client reconnects with a new id, it may want to retry the command and get the cached reply. I can't see that we can evict them either. If it turns out to be a problem, then it would be good if a client can ACK the reply in some way so the server can delete the reply ASAP. A concrete idea for how to encode this in RESP3+ would be to allow the client to set RESP3 attributes on a command. A command is an array of strings, AKA multibulk, but the client could add an attribute to it if the server supports it. Example:
Another feature for a future RESP version could be multiplexing, i.e. the possibility to use multiple commands-reply streams, possibly with separate clients IDs, over the same connection. It was mentioned in #54 and one of the linked redis issues. |
Not as far as I know, they are used for a whole lot of things.
Absolutely, hence the TTL, and also why I think the semantic could be that only at most one such reply is saved. I also think it would be fine to clear the saved response on any subsequent command from the same client (whether it's a In addition, we could make saving the response purely opt-in, e.g. Recording the response is useful when otherwise information would be lost (e.g.
That too I'm afraid would likely break some proxies. I think you could achieve the same thing by introducing a command to hold these info:
It is maybe a little bit unusual because it's a stateful command. But it's essentially the same as the |
2 questions for your idea:
|
It's the responsibility of the client to use a sufficiently random source. Typically a UUID would be suggested (but I don't think it need to be enforced).
That is indeed something that could cause problems if the feature is abused. My reasoning in that in the large majority of cases, non-idempotent commands don't generate large output or if they do it's output that have been removed from an existing data structure, so the memory usage should be mostly neutral (e.g. But I can't really think of a way that would guarantee this, especially given extension module can do whatever they want, and that in the case of a transactions with multiple commands, it may also include read commands that do return a lot of data. If really saving the response is deemed too problematic, as mentioned before it's also possible to just return a specific error and let the client know it has to reconcile, it's more work for the client and doesn't cover some use cases, but would already be a huge improvement. |
Regarding LPOP specifically, it appears that LMOVE was designed to solve exactly this problem of unreliable connections. It puts the responsibility on the user to use this command though, rather than the client lib. From the man page:
...
|
Yes absolutely. I'm using If you want a real world example that happened to us not long ago, and prompted me to think about this feature again, it was with LPUSH: Shopify/ci-queue#287 |
Yeah, I know the problem is real. I'm just trying to think of related topics to get the bigger picture. It's interesting that this problem was mentioned in LMOVE's docs. It means this is not the first time this problem is discussed. |
Idempotency is a long standing issue with Redis and Valkey, and is a hard problem generally within computer science. It's why streams have the consumer groups so that you can do acknowledged reads/writes. I think it's definitely worth trying to make it better, but some of my initial thoughts:
@barshaul @avifenesh Any thoughts about this from the client perspective? |
To be clear, I totally see how saving the response can be challenging, so if it's deemed too complicated / risky I'll understand. Just having a specific error that allow to know the command was already perform would already be a huge leap forward. The big benefit of recording the response would be to allow "dumb clients" like mine to retry transparently. |
This was probably a key to success. To keep it simple, we could continue this tradition. We could recommend folks to work around it:
|
I've done this a lot. I did a talk at Kubecon where we applied changes from a change stream and used a lua script + idempotency key to check if something was applied. We were using the non-idempotent commands like |
Talking from clients side (Glide) and client side only: If it ends up implemented, I would actually hope that it will be a kind of transparent API, where you call it |
While ignoring correctness may have been acceptable in the past, I don't think we can continue operating that way if we want Valkey to be the default choice for application developers. Idempotency handling is a difficult problem, and it has come up in my experience multiple times too. If we find ourselves discussing workarounds at conferences for handling it client-side, I think it’s time to seriously consider a server-side solution. Offloading idempotency handling to clients shifts complexity from the server to the clients, and there are far more clients out there. That said, while this is a real problem worth addressing, I don't believe it’s an "80% problem" either. Therefore, making significant changes to the RESP protocol feels excessive. Extending MULTI would be a much more natural and backward-compatible solution.
That’s a great point. I think we’d need to track the new states at a per-slot level. It might make sense to sequence this after the RDB-based Atomic Slot Migration work so that we can easily pass along additional metadata. As for the other comments on the issue, I believe they’re mostly tactical, concerns about memory usage efficiency, caching error responses, or the exact command structure. These are important, but I don't think they would invalidate the overall problem statement.
@avifenesh, I agree that this would increase client-side complexity, but do you think it’s still preferable to having application developers implement their own idempotency error handling? There are orders of magnitude more applications than clients, and far more clients than the server, which is just one. |
@PingXie You right, i opened by saying that it's a good idea. My point is that it is not whole. As a client dev id like to have fully baked and ready to eat features, which do not require me to create workarounds and special configurations. If it was a fix to some bug, it is what it is, you deliver fast. |
I think 5 is the most important topic to discuss, which is are we going to cache the response or are we just going to error. The remaining points are mostly derivative of this issue. We should probably discuss this in a core meeting. @byroot Let me know if you want to join, it's every monday at 14:00 UTC. It seems like you're in France, which makes it a pretty nice time for you :) |
That should work for me. |
For context, I'm the maintainer of https://rubygems.org/gems/redis and https://rubygems.org/gems/redis-client, and I was chatting with @stockholmux a few days ago about what I think is a recurrent pain point for users.
The problem/use-case that the feature addresses
When a client perform a command, if it times out, or it hits a network error, it's pretty much impossible to know if the command was performed on the server side or if it never reached it. Because of this, it is only safe to retry idempotent commands, for any other command, retrying is unsafe which makes it hard to implement clients that are resilient to various network issues.
This isn't specific to valkey nor the RESP protocol, the same sort of issue is common with SQL clients and most datastores.
But this is particularly common in the Ruby community because the historical redis/valkey client, initially developed by Antirez, always defaulted to blindly retry anything in case of network error, and it's now very hard to change because users, especially in cloud environment, then notice a lot more network errors that were previously swept under the rug. I'm trying to course correct that in the lower level
redis-client
gem by making retry opt-in and requiring users to be more intentional on whether retry should happen for a given command or transaction, but it only goes so far.So I believe it would be very valuable if there was a way to safely retry non-idempotent commands.
Description of the feature
I think a way this could be achieved, would be for the
MULTI
command to accept an "idempotency key", somewhat inspired from what Stripe and Paypal APIs are doing.e.g.
The semantic would be that if that transaction succeed and another transaction is performed with the same
random-id
withingttl
seconds, the repeated transaction isn't actually executed and instead return the same results than the previous transaction, without any side effect.One potential issue I can see with this is that if used with command that return large output, it has the potential to require saving a whole lot more data on the server, but my assumption is that generally speaking, non-idempotent queries for which this is useful tend to return smaller responses. But this is in part counter acted by the TTL, allowing the server not to need to hold on the result for too long. It might also be possible to specify that only one such result set can be held onto per connection.
Another possibility is to not return the original response on repeated transaction, and just have
MULTI
return some error instead ofQUEUED
, but it is much less convenient for the client, as sometimes you are not just interested in the side effect of the transaction. Typically if you need to retry anLPOP
, the poped item is lost. But perhaps that too can be an option.Alternatives you've considered
Another possibility could be to bake this capability in the protocol itself, so it's not limited to transactions, but I don't think it's desirable to change the protocol.
The text was updated successfully, but these errors were encountered: