Activity reverts from state "Started" to "Scheduled" after 1 hour

suejungshin · August 10, 2020, 1:33am

Hello,
I have a workflow that reaches an activity that requires human input to complete the activity (similar to the Go-SDK expense asynchronous activity completion example).
The activity (called “ManualTaskActivity”) is set up to return errResultPending.
Once the workflow is started and calls ExecuteActivity(...).Get on the “ManualTaskActivity”, in the Temporal UI & CLI, workflow describe shows this activity in the “pendingActivities” list as state “Started” and workflow show shows this activity as “Started”.

However, after exactly 1 hour, describe workflow shows this activity in the “pendingActivities” list as state “Scheduled” (no longer “Started”), while the show workflow list remains unchanged (the activity is still in “Started” state).
This results in any attempt now to CompleteActivity or CompleteActivityById to fail with error “invalid activityID or activity already timed out or invoking workflow is completed” / “ErrActivityTaskNotFound” presumably because the “pendingActivities” list does not contain the activity as state “Started”.

The workflow is now stuck here. I can get it unstuck by calling docker run --net=host --rm temporalio/tctl:0.26.0 admin workflow refresh_tasks -w <workflowID> which starts a new 1 hour clock and results in an additional line in workflow show of ActivityTaskStarted.
The logs reflecting the above are saved at this gist: https://gist.github.com/suejungshin/e391ac0ac02d054b66b7a7a5025630f2

My questions are:

What is going on here?
The activity must have gotten cancelled or timed out somehow? How to troubleshoot where/when/how and prevent it from happening, as we want to give enough time for human input
What role would Golang context.Context cascading cancellations play into it, if any?

Theories:

Something within the Go context.Context is causing a cancellation to cascade and call RespondActivityTaskCanceled? But how/where? I thought the Activity function has essentially already been invoked so why do context.Context cancellations still reach it?

Other behaviors I cannot explain:

I experimented by removing the optional “context.Context” first argument from the Activity and that resulted in now instead of a 1 hour clock to get demoted from “Started” to “Scheduled”, it was now 2 hours. I cannot replicate consistently, but sometimes the “demotion” from “Started” to “Scheduled” happens < 2 hours (potentially related to me restarting the Activity server so the previous worker is no longer there?). What happens when I restart the worker that had returned ErrResultPending, does that manual task then become not completable?
If I try to “CompleteActivity” using the wrong ActivityID by mistake for a given workflow ID, it mimics above behavior of demoting the current pending ManualTaskActivity from state “Started” to state “Scheduled” and then not possible to “CompleteActivity” on it

Any help is very much appreciated. Or if there is an alternate way instead of relying on ErrResultPending that I should implement for this use case of waiting for human input since above seems to result in workflows getting stuck after 1 hour.

Thanks very much for your help!!

P.S. Some assumptions made (please let me know if any are incorrect):

My understanding is that heartbeating for long-running tasks is not meant to be paired with this async activity errResultPending. For troubleshooting purposes, I tried inserting a RecordHeartbeat in the activity prior to returning errResultPending but that didn’t change above behavior.
Based on the docs/expense example, the ScheduleToStart and StartToClose deadlines should be set equal to the amount of time we need to wait for human intervention. I set it to ~3 months.
My understanding of the various timeouts in play:

WorkflowExecutionTimeout: how long to wait for whole workflow to execute (should always be longer than any of the below), my example is set to 10 years
WorkflowRunTimeout: how long to wait for a single run of a workflow to execute, my example is set to 10 years
WorkflowTaskTimeout: how long to wait for a Decision Task to be executed (?), should be short? my example is set to 10 seconds
Activity scheduleToCloseTimeout: should be at least as long as sum of scheduleToStart + scheduleToClose, my example is set to ~6 months
Activity scheduleToStartTimeout: should be the amount of time to wait for the async human input, my example is set to ~3 months
Activity startToCloseTimeout: should be the amount of time to wait for the async human input, my example is set to ~3 months
What else can cause activity timeouts or cancellations and not be reflected in the UI/CLI or trigger a Timeout to be registered and workflow to proceed?

maxim · August 10, 2020, 2:59pm

The history with two ActivityTaskStarted events in a row looks invalid. What database binding your service use? Can you reproduce the problem with v0.28.0 version of the service and the SDK?

suejungshin · August 10, 2020, 6:31pm

Hi Maxim, thanks very much for your help.
The above was using Postgres. I am working on seeing if I can reproduce the issue when using Temporal v0.28.0 (instead of v0.26.0). I will report back.

maxim · August 10, 2020, 6:32pm

Postgres is not ready for production and has known (and possibly a bunch of unknown) bugs. We will invest in its productization in the future, but for now stick with MySQL or Cassandra.

suejungshin · August 13, 2020, 12:51am

Hello, sorry for the late reply. The error did not manifest itself when using MySQL or Cassandra (using either Temporal v0.26.0 or v0.28.0), so we will be changing our choice of database. Thank you for your help.
Do you have any updates on when/whether Scylla might be supported? I saw some discussion in the Slack group that with the latest update to Scylla LWT, it could become an option, but hasn’t been tested fully yet. Any other news on that front?
Thank you!

maxim · August 13, 2020, 1:15am

We certainly plan to evaluate Scylla, but not ETA yet. AFAK Scylla still had some showstopper bugs around LWT last time we had this conversation.

suejungshin · August 13, 2020, 5:46pm

I see, makes sense. Thank you so much for your help!

Topic		Replies	Views
Activities staying in pending Community Support go-sdk	3	1894	October 6, 2023
Activity is scheduled but does not start Community Support	7	1692	March 31, 2021
Activity scheduled but not started (need help) Community Support go-sdk	22	5263	June 27, 2022
Scheduling of activity gets stuck for about 10 minutes before starting Community Support go-sdk	4	340	February 9, 2024
Activity stuck after activity timeout Community Support activity , timeout	9	1729	June 2, 2021

Activity reverts from state "Started" to "Scheduled" after 1 hour

Related topics