Java Client / Server Behavior With a Large Influx of Signaled Workflows #4387

WToma · 2021-08-18T20:18:12Z

WToma
Aug 18, 2021

Hi all,

We have the following pattern for some of our workflows in Cadence:

start workflow
wait for an external event to happen, up to some max using Workflow.await (we're using the Java client)
external event signals the workflow which will allow it to proceed

Sometimes we get a large influx of workflows (150/s for several minutes), which saturates our worker fleet in the sense that the number of active workflows becomes larger than the sticky thread limit (600) in each worker. So at that point some workflow threads start getting destroyed, and then later replayed. This part is expected to some extent.

However we're seeing that our throughput (workflows completed per second, and even activities completed per second) doesn't follow the load increase at all. But once the number of open workflows drops, then we start getting more activities per second (and workflow completions per second). As far as we can tell from our own logs, we're way below the theoretical max throughput based on the number of concurrent activities allowed per worker.

We were wondering about the following things:

Does the Java client / the Cadence server itself have any kind of prioritization between existing / new workflows / signals?
Is there a way for a client not to pull more workflows than its thread count limit?
What client side or server side metrics should we watch (besides the thread count) to understand the behavior? To be a bit more concrete, our theory here is that during replays, the replayed workflows have to fetch their histories, which is happening a lot / taking a long time, and this is contributing to the low throughput. However so far none of the server side metrics we found confirmed this.
Would potentially adding more frontend, or history instances help with this situation, assuming the number of client workers stays the same?

Thanks for reading!

meiliang86 · 2021-08-19T00:54:12Z

meiliang86
Aug 19, 2021

Does the Java client / the Cadence server itself have any kind of prioritization between existing / new workflows / signals?

Assuming you only use one domain for all the workflows, there is no prioritization. All your activity tasks and decision tasks will effectively go through a FIFO queue on the matching side.

Is there a way for a client not to pull more workflows than its thread count limit?

Client will continue to poll tasks, until the Executor is full and RejectExecutionException is thrown, in which case it will backoff. You can check the logic here.

What client side or server side metrics should we watch (besides the thread count) to understand the behavior? To be a bit more concrete, our theory here is that during replays, the replayed workflows have to fetch their histories, which is happening a lot / taking a long time, and this is contributing to the low throughput. However so far none of the server side metrics we found confirmed this.

Start with the server side matching metrics, i.e. service:cadence-matching name:poll_{errors,success,success_sync,timeouts} operation:tasklistmgr, operation:addactivitytask, operation:adddecisiontask, as well as the pollForActivityTasks and pollForDecisionTasks latency.
If poll latency is low, it means your tasks are dispatched efficiently.
These metrics should also tell you the load of activity v.s. decision tasks and which is potentially the bottleneck here.

And then check client side sticky cache utilization metrics name:{cadence-sticky-cache-hit,cadence-sticky-cache-miss,cadence-sticky-cache-stall,cadence-sticky-cache-thread-forced-eviction,cadence-sticky-cache-total-forced-eviction}

My guess here is that the delay is caused by stickiness. If your cache utilization is low, a lot of your workflows will be kicked out of the decider cache when a decision task is dispatched. If stickiness is enabled (it's enabled by default), Cadence will try to distribute the task to the old host, and the task will fail there as the cache entry is already cleared. Then there will be a wait and retry effort to re-dispatch the decision task to a different worker.

Would potentially adding more frontend, or history instances help with this situation, assuming the number of client workers stays the same?

Not necessary. If it's the stickiness issue you can (1) increase worker pool size (2) reduce the wait before retry or (3) disable stickiness. If it's not the stickiness issue we need to look further.

6 replies

meiliang86 Aug 19, 2021

Activity and Decision tasks have their own pollers. The polling logic is the one that I linked above. Only idle deciders can be evicted from cache. There are maxConcurrentActivityExecutionSize and maxConcurrentWorkflowExecutionSize options in WorkerOptions.

WToma Aug 19, 2021
Author

So this makes sense to me for activity pollers, and it's consistent with what we're experiencing.

However for decision tasks Long had the following answer earlier (slack):

From doc it says

Maximum number of parallely executed decision tasks.

It's a bad name because the name seems to control maxinum workflow running in parallel, which is related to #2403 that we haven't implemented
If you disable decision sticky cache(via disableStickyExecution in WorkerFactory), then the threads are always destroyed right after every decision task is finished.
By default decision stiky is used, so the threads will be kept until workflow is completed, or the cache is evicted.
Cache eviction happens when the number of threads being held is greater than
stickyCacheSize
which defaults to
DEFAULT_STICKY_CACHE_SIZE (600)
Eviction is based on LRU of workflow starting running in the worker.

But it doesn't seem like the poller stops at 600, because then we wouldn't see the destroyed workflow threads. Or am I missing something here?

WToma Aug 20, 2021
Author

And then check client side sticky cache utilization metrics name:{cadence-sticky-cache-hit,cadence-sticky-cache-miss,cadence-sticky-cache-stall,cadence-sticky-cache-thread-forced-eviction,cadence-sticky-cache-total-forced-eviction}

So I wasn't able to find these in our client-side metrics. We're using a fairly old version of the Java client (2.7.3), do we need to upgrade to 3.x to have these available?

WToma Aug 20, 2021
Author

So these numbers look OK to me, what do you think? Maybe the one interesting thing is the poll_success_sync, I'm not quite sure how to interpret that, compared to just poll_success. Btw I only found these metrics for TaskListMgr (and didn't find poll errors or timeouts at all).

WToma Aug 20, 2021
Author

Another side question (sorry! let me know if you'd prefer to make this into a separate discussion) that came up while I was compiling the above data is that I realized I cannot easily differentiate between the scenarios where SignalWithStart started a new workflow, vs when it signaled an existing one. Is there a way to tell that apart, or maybe a metric that shows the number of signals sent?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Java Client / Server Behavior With a Large Influx of Signaled Workflows #4387

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Java Client / Server Behavior With a Large Influx of Signaled Workflows #4387

WToma Aug 18, 2021

Replies: 1 comment · 6 replies

meiliang86 Aug 19, 2021

meiliang86 Aug 19, 2021

WToma Aug 19, 2021 Author

WToma Aug 20, 2021 Author

WToma Aug 20, 2021 Author

WToma Aug 20, 2021 Author

WToma
Aug 18, 2021

Replies: 1 comment 6 replies

meiliang86
Aug 19, 2021

WToma Aug 19, 2021
Author

WToma Aug 20, 2021
Author

WToma Aug 20, 2021
Author

WToma Aug 20, 2021
Author