Hello,
I have a workflow that reaches an activity that requires human input to complete the activity (similar to the Go-SDK expense asynchronous activity completion example).
The activity (called “ManualTaskActivity”) is set up to return errResultPending
.
Once the workflow is started and calls ExecuteActivity(...).Get
on the “ManualTaskActivity”, in the Temporal UI & CLI, workflow describe
shows this activity in the “pendingActivities” list as state “Started” and workflow show
shows this activity as “Started”.
However, after exactly 1 hour, describe workflow
shows this activity in the “pendingActivities” list as state “Scheduled” (no longer “Started”), while the show workflow
list remains unchanged (the activity is still in “Started” state).
This results in any attempt now to CompleteActivity
or CompleteActivityById
to fail with error “invalid activityID or activity already timed out or invoking workflow is completed” / “ErrActivityTaskNotFound” presumably because the “pendingActivities” list does not contain the activity as state “Started”.
The workflow is now stuck here. I can get it unstuck by calling docker run --net=host --rm temporalio/tctl:0.26.0 admin workflow refresh_tasks -w <workflowID>
which starts a new 1 hour clock and results in an additional line in workflow show
of ActivityTaskStarted.
The logs reflecting the above are saved at this gist: https://gist.github.com/suejungshin/e391ac0ac02d054b66b7a7a5025630f2
My questions are:
- What is going on here?
- The activity must have gotten cancelled or timed out somehow? How to troubleshoot where/when/how and prevent it from happening, as we want to give enough time for human input
- What role would Golang context.Context cascading cancellations play into it, if any?
Theories:
- Something within the Go context.Context is causing a cancellation to cascade and call
RespondActivityTaskCanceled
? But how/where? I thought the Activity function has essentially already been invoked so why do context.Context cancellations still reach it?
Other behaviors I cannot explain:
- I experimented by removing the optional “context.Context” first argument from the Activity and that resulted in now instead of a 1 hour clock to get demoted from “Started” to “Scheduled”, it was now 2 hours. I cannot replicate consistently, but sometimes the “demotion” from “Started” to “Scheduled” happens < 2 hours (potentially related to me restarting the Activity server so the previous worker is no longer there?). What happens when I restart the worker that had returned ErrResultPending, does that manual task then become not completable?
- If I try to “CompleteActivity” using the wrong ActivityID by mistake for a given workflow ID, it mimics above behavior of demoting the current pending ManualTaskActivity from state “Started” to state “Scheduled” and then not possible to “CompleteActivity” on it
Any help is very much appreciated. Or if there is an alternate way instead of relying on ErrResultPending
that I should implement for this use case of waiting for human input since above seems to result in workflows getting stuck after 1 hour.
Thanks very much for your help!!
P.S. Some assumptions made (please let me know if any are incorrect):
- My understanding is that heartbeating for long-running tasks is not meant to be paired with this async activity
errResultPending
. For troubleshooting purposes, I tried inserting a RecordHeartbeat in the activity prior to returningerrResultPending
but that didn’t change above behavior. - Based on the docs/expense example, the
ScheduleToStart
andStartToClose
deadlines should be set equal to the amount of time we need to wait for human intervention. I set it to ~3 months. - My understanding of the various timeouts in play:
- WorkflowExecutionTimeout: how long to wait for whole workflow to execute (should always be longer than any of the below), my example is set to 10 years
- WorkflowRunTimeout: how long to wait for a single run of a workflow to execute, my example is set to 10 years
- WorkflowTaskTimeout: how long to wait for a Decision Task to be executed (?), should be short? my example is set to 10 seconds
- Activity scheduleToCloseTimeout: should be at least as long as sum of scheduleToStart + scheduleToClose, my example is set to ~6 months
- Activity scheduleToStartTimeout: should be the amount of time to wait for the async human input, my example is set to ~3 months
- Activity startToCloseTimeout: should be the amount of time to wait for the async human input, my example is set to ~3 months
- What else can cause activity timeouts or cancellations and not be reflected in the UI/CLI or trigger a Timeout to be registered and workflow to proceed?