I had a project that ran a workflow which handled a long running activity. This activity was assumed would not complete on its own, hence the activity was wrapped inside a cancellation scope with timeout. However, sticky queue task timeouts led to the workflow handler expiring before the cancellation scope could complete its work. This caused the workflow handler to shut down and reboot, switching from the sticky queue to the regular queue to resume execution.
From the documentation, I made use of the heartbeat method to handle the graceful shutdown of the activity, but this only kind of worked in the local development version and constantly failed with workflow task timeout in production.
Here is a blog I recently wrote on how I tackled this particular problem, I am posting this hoping this will help others aswell.
Thanks a lot for blogging about Temporal
I didn’t understand the actual problem described in the blog. An activity execution (or failure to execute) doesn’t directly affect workflow task processing.
Unless activity consumes 100% of CPU it shouldn’t block workflow task processing and cause the sticky queue schedule to start timeout.
hey maxim,
okay so the workflow structure was something like this :
workflow {
cancellationScope.withTimeout(timeoutDuration,activity);
continueAsNew()
}
and the activity was basically a while loop:
activity {
while(true){
** …activity operations**
await Sleep()
}
i thought the activity even though it doesnt explicitly return anything or complete on its own since its an infinite while loop doing certain job on repeat and sleep after each loop iteration, to shut it down before resetting the workflow the cancellation scope would do the job. I had also introduced heartbeats inside the activity.
This worked on my local setup while i was testing, but in production for some reason the workflow task timed out after the cancellation scope timer fired and the next workflow task got scheduled.
There were no cpu resource maxing out, everything interms of infra were normal.
After going through some exisiting support forums understood that the initial workflow handler rebooting because of workflow task timeout was because of the sticky queue task timeout setting. Hence as a last try tried with limiting the activity run duration and complete it like 10s before the cancelation scope time out, and this solved the issue. I was using the typescript sdk. The workflow handler error logs were something like these = core typescript sdk, workflow task timed out, workflow state changed, state : stopping, state : stopped, state : draining, state : drained, state : started
It would be nice to understand your problem. Not all the messages you posted came from Temporal SDK.
Would you explain what you mean by “workflow handler”? Temporal doesn’t have such a concept.
Hi @sumukh_upadhya!
Thanks for sharing your experience with Temporal.
However, due to how Temporal handles sticky queues and workflow task timeouts, this didn’t work as expected.
I’m having difficulties understanding the situation you describe. It looks like you have done some research work, but I think you are confusing some of the lower level concepts, and that results in drawing incorrect conclusions.
Let me try to clarify some of the notions. I’ll comment snippets from both this post and your blog post.
In this case, the workflow handler instance is forced to stop, and a new workflow handler session is started. This new session polls from the regular task queue and replays the entire workflow history to resume execution. Before the old workflow handler completely stops, its cached state is drained and passed back to the Temporal Server and the unique sticky task queue is dismantled.
That description makes it sounds like this is a very complex and expensive process. Reality is much simpler, and though it’s true that there’s a performance cost to replaying a workflow execution, it really isn’t that bad.
When a Workflow Task fails or times out, the Worker simply drops the cache state associated with that Workflow; we call that a cache eviction, and that’s basically just removing entries from a few maps locally, which is very cheap.
There’s nothing sent from the Worker to the Server on cache eviction. The server already knows everything that has happened to that workflow execution, as that is written in the Workflow history, which is always under the server’s control. The Worker don’t even inform the server that it has evicted a workflow; the server simply knows that a Workflow Execution will no longer be attached to the sticky task queue after a Workflow Task error or timeout.
You repeatedly use the terms “workflow handler”, “workflow handler instance” and “workflow handler session”, mentioning various life cycle events and responsibility… There’s no such thing. Workflow Tasks are polled from the Server by the Workflow Worker, and processing/execution of those tasks results in the cached state. The sticky task queue is associated with the Worker (that’s a one-to-one relationship), and doesn’t get drained/dismantled/recreated when a Workflow Execution gets evicted from cache.
When a task is polled from the sticky task queue, the Temporal Server waits up to 10 seconds for the workflow handler to report the task’s completion. If the handler fails to respond within this period, the workflow task times out and throws an error :: “workflow task timeout”.
That’s kind of true, but not exact. There are two distinct timeouts involved here:
-
The
stickyQueueScheduleToStartTimeout
(defined inWorkerOptions
) controls how long a Workflow Task may remain on the sticky task queue until it gets picked up by it’s “owner” Workflow Worker. That is a Workflow Task “schedule-to-start” timeout. It is meant to ensure that a Workflow Execution doesn’t get stalled if the Worker that “owned” that workflow execution gets shutdown or die unexpectedly, or become overloaded and unresponsive to the point that it is no longer polling for new tasks. -
The
workflowTaskTimeout
(defined inWorkflowOptions
) controls how long a Worker may take to send back a Workflow Task completion message after picking up that Workflow Task. That is astart-to-close
timeout, and it applies both to Workflow Tasks delivered through normal and sticky task queues. This timeout is meant to ensure that a Workflow won’t hang if a Worker dies unexpectedly while processing a Workflow Task.
Now, both of those timeouts default to 10 seconds, both will result in a Workflow Task Timeout in that Workflow history, and in both cases, the Workflow Task will be rescheduled on the normal task queue. But they are symptoms of slightly different situations.
In fact, sticky queue schedule-to-start timeouts are expected and not a problem at all if they are due to a Workflow Worker getting stopped. Save for that specific scenario, however, Workflow Task timeouts always indicate that something’s wrong with your Worker:
- something is preventing or slowing down response back to the Temporal server (e.g. transiant network issues, RPS limits, etc.);
- the Worker is configured to do more than it really can, given the resources it has;
- something else in your Node process is blocking the event loop (e.g. CPU bounds activities).
This activity was assumed would not complete on its own, hence the activity was wrapped inside a cancellation scope with timeout
One such issue occurred when a long-running activity was wrapped inside a cancellation scope with a timeout.
If the timeout only applies to that single activity, I’d strongly recommend having your workflow sets a scheduleToCloseTimeout
constraint on the activity, instead of using a Cancellation Scope with timeout. That’s both simpler and more efficient, as your workflow worker doesn’t need to get involved in the in cancellation of your activity.
The solution is straightforward: First, determine the maximum duration (T) you expect the long-running activity to run for a single session. Then, set the cancellation scope timeout to T + X seconds, where X is the additional time allowed for the activity to shut down and report back, if the activity is meant to be retriable then the cancellation scope should be (T x R) + X, where R is the retry count.
That’s pretty much right, though X
should be very small, well under 100ms; in the case of a long-running activity, that should be negligible compared to T
. On the other side, you might need to account for the time it will take for the activity to get picked up by an activity worker, i.e. in case all your activity workers are already maxed out. Calling that time Y
, the formula would then be something like (T + Y) x R
.
Also, as I said before, I’d set a scheduleToCloseTimeout
constraint on the activity instead. The math remains the same.
In either case, you should also set the activity’s startToCloseTimeout
to T
, and set heartbeatTimeout
to the maximum acceptable interval between two consecutive activity heartbeats.
However, because the activity did not shut down on time, the workflow handler failed to capture the shutdown state before the sticky task timeout expired. According to Temporal’s documentation, introducing heartbeats in long-running activities can help the workflow handler know that an activity is still running, so the workflow waits for a graceful shutdown. Unfortunately, in this instance, the process couldn’t complete within the sticky queue timeout, resulting in the workflow handler timing out.
However, sticky queue task timeouts led to the workflow handler expiring before the cancellation scope could complete its work.
Those conclusions are incorrect, because they are based on incorrect assumptions regarding the execution model of the Temporal SDK.
Also, I’m not sure what you mean by “the activity did not shut down in time” and “the workflow waits for a graceful [activity] shutdown”. An activity is not a process that gets shutdown.
The workflow handler error logs were something like these = core typescript sdk, workflow task timed out, workflow state changed, state : stopping, state : stopped, state : draining, state : drained, state : started
Please copy-paste those error messages as they appear in your logs. I really can’t make sense out of those extracts alone.