-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nats: Make MaxAckPending and MaxAckWaitTime configurable #10393
Comments
should we already add those settings to our oCIS chart with NATS deployment example? https://github.com/owncloud/ocis-charts/blob/main/deployments/ocis-nats/streams/ocis.yaml exposes all the stream settings already. Our Kubernetes deployments already could benefit from this a lot!? |
Yes it should. If possible please set them. |
@kobergj do you have proposals for Just to interweave our ticket clusters a little bit: #8949 would maybe take out some pressure on high message count situations!? It also would allow more fine granular redeliver settings. Eg. the postprocessing messages could need a very high |
That sounds reasonable. The requirements for the different use cases could be fulfilled better in separate streams. |
Not sure about this. If postprocessing service is busy it doesn't matter how many queues it talks to. Maybe we should use regexes to subscribe only for specific events?
No. This is unrelated to the time a step needs to complete. As soon as the message is pulled from the channel, it will be acked on the nats server. The problem is that pending messages get redelivered, therefore multiple (equal) events are waiting on the same channel to be picked (and acked). Recommendation is to keep |
For the short term, I think we can make those options configurable to have a better control of the problem, which could also help to debug the problem by reducing the wait time so the issue happens more often, but I don't think it will fix the issue. I mean, 30 secs should be enough for the postprocessing service (or any other consumer) to get the event and send the ACK, so if that's a problem we could consider that 5 minutes might not enough and we need a greater value. The idea I have is to include an ever-increasing counter for each event type. Basically, saying that "this is the event number 123 of the share_created event" for example. This counter must be part of the event information, either as an event property or as part of the event metadata. Assuming each event type is generated in only one place, having a counter there to know how many events have been sent should be easy. This is basically the only responsibility the event publishers will have (in addition to other responsibilities that they could already have) From the consumer's side, we need to keep track of the last event of that type we've processed. For example, if we've processed the "share_created" event number 123, we should expect the "share_created" event 124, so any "share_created" event with a counter lower than 124 (including 123) should be considered as processed and can be ignored. This is how we could deal with duplicated events. There are some edge cases to consider for this solution though:
Basically, we'd need to include the following information in the event:
For nats, most of the stuff above is very likely taken care by nats in some way or another. The question is whether we're fetching the events one by one, so the only event being re-delivered is the last one (waiting in our queue to be ACK'ed). Basically, in case of delays, the expected sequence of events should be |
Another solution could be switching to the pull client instead of using the push client? |
Let me explain the problem again, I think there is still some misunderstanding. The default value for This is a simple configuration problem. With a low Putting more work to the consumer will only make the problem worse. If we want to improve performance of the event consumers we should use regex filtering on the event names. Afaik that is done on nats side and therefore doesn't put pressure on our consumers. |
I'm not sure... If we're processing events one by one, is there any benefit to set the I'm worried this will end up being a test in production in order to figure out the right values for the environment, and those values might not be good enough if the system is overloaded. |
When putting heavy load to nats receivers the nats service will possibly redeliver events and therefore duplicate them. This is because we do not configure it correctly.
When pulling events from the queue the expected behaviour is the following:
The event is pulled from the queue and delivered into a channel. When consumed from this channel the event will be acknowledged on the nats server. On high load this can cause problems. Sine
MaxAckWaitTime
is30s
by default, the nats server will redeliver the event after 30s. Leading to two identical events waiting to be picked up from the channel. SinceMaxAckPending
is set to1000
by default, this can lead to up to 1000 identical events waiting for being processed.We already introduced worker groups to be able to handle multiple events at the same time. We now need to make
MaxAckWaitTime
andMaxAckPending
configurable, so they can configured individually.Acceptance Criteria:
MaxAckWaitTime
andMaxAckPending
The text was updated successfully, but these errors were encountered: