Replies: 1 comment 6 replies
-
Assuming you only use one domain for all the workflows, there is no prioritization. All your activity tasks and decision tasks will effectively go through a FIFO queue on the matching side.
Client will continue to poll tasks, until the Executor is full and RejectExecutionException is thrown, in which case it will backoff. You can check the logic here.
Start with the server side matching metrics, i.e. And then check client side sticky cache utilization metrics My guess here is that the delay is caused by stickiness. If your cache utilization is low, a lot of your workflows will be kicked out of the decider cache when a decision task is dispatched. If stickiness is enabled (it's enabled by default), Cadence will try to distribute the task to the old host, and the task will fail there as the cache entry is already cleared. Then there will be a wait and retry effort to re-dispatch the decision task to a different worker.
Not necessary. If it's the stickiness issue you can (1) increase worker pool size (2) reduce the wait before retry or (3) disable stickiness. If it's not the stickiness issue we need to look further. |
Beta Was this translation helpful? Give feedback.
-
Hi all,
We have the following pattern for some of our workflows in Cadence:
Workflow.await
(we're using the Java client)Sometimes we get a large influx of workflows (150/s for several minutes), which saturates our worker fleet in the sense that the number of active workflows becomes larger than the sticky thread limit (600) in each worker. So at that point some workflow threads start getting destroyed, and then later replayed. This part is expected to some extent.
However we're seeing that our throughput (workflows completed per second, and even activities completed per second) doesn't follow the load increase at all. But once the number of open workflows drops, then we start getting more activities per second (and workflow completions per second). As far as we can tell from our own logs, we're way below the theoretical max throughput based on the number of concurrent activities allowed per worker.
We were wondering about the following things:
Thanks for reading!
Beta Was this translation helpful? Give feedback.
All reactions