Activity reverts from state "Started" to "Scheduled" after 1 hour

Hello,
I have a workflow that reaches an activity that requires human input to complete the activity (similar to the Go-SDK expense asynchronous activity completion example).
The activity (called “ManualTaskActivity”) is set up to return errResultPending.
Once the workflow is started and calls ExecuteActivity(...).Get on the “ManualTaskActivity”, in the Temporal UI & CLI, workflow describe shows this activity in the “pendingActivities” list as state “Started” and workflow show shows this activity as “Started”.

However, after exactly 1 hour, describe workflow shows this activity in the “pendingActivities” list as state “Scheduled” (no longer “Started”), while the show workflow list remains unchanged (the activity is still in “Started” state).
This results in any attempt now to CompleteActivity or CompleteActivityById to fail with error “invalid activityID or activity already timed out or invoking workflow is completed” / “ErrActivityTaskNotFound” presumably because the “pendingActivities” list does not contain the activity as state “Started”.

The workflow is now stuck here. I can get it unstuck by calling docker run --net=host --rm temporalio/tctl:0.26.0 admin workflow refresh_tasks -w <workflowID> which starts a new 1 hour clock and results in an additional line in workflow show of ActivityTaskStarted.
The logs reflecting the above are saved at this gist: https://gist.github.com/suejungshin/e391ac0ac02d054b66b7a7a5025630f2

My questions are:

  1. What is going on here?
  2. The activity must have gotten cancelled or timed out somehow? How to troubleshoot where/when/how and prevent it from happening, as we want to give enough time for human input
  3. What role would Golang context.Context cascading cancellations play into it, if any?

Theories:

  • Something within the Go context.Context is causing a cancellation to cascade and call RespondActivityTaskCanceled? But how/where? I thought the Activity function has essentially already been invoked so why do context.Context cancellations still reach it?

Other behaviors I cannot explain:

  • I experimented by removing the optional “context.Context” first argument from the Activity and that resulted in now instead of a 1 hour clock to get demoted from “Started” to “Scheduled”, it was now 2 hours. I cannot replicate consistently, but sometimes the “demotion” from “Started” to “Scheduled” happens < 2 hours (potentially related to me restarting the Activity server so the previous worker is no longer there?). What happens when I restart the worker that had returned ErrResultPending, does that manual task then become not completable?
  • If I try to “CompleteActivity” using the wrong ActivityID by mistake for a given workflow ID, it mimics above behavior of demoting the current pending ManualTaskActivity from state “Started” to state “Scheduled” and then not possible to “CompleteActivity” on it

Any help is very much appreciated. Or if there is an alternate way instead of relying on ErrResultPending that I should implement for this use case of waiting for human input since above seems to result in workflows getting stuck after 1 hour.

Thanks very much for your help!!

P.S. Some assumptions made (please let me know if any are incorrect):

  • My understanding is that heartbeating for long-running tasks is not meant to be paired with this async activity errResultPending. For troubleshooting purposes, I tried inserting a RecordHeartbeat in the activity prior to returning errResultPending but that didn’t change above behavior.
  • Based on the docs/expense example, the ScheduleToStart and StartToClose deadlines should be set equal to the amount of time we need to wait for human intervention. I set it to ~3 months.
  • My understanding of the various timeouts in play:
  1. WorkflowExecutionTimeout: how long to wait for whole workflow to execute (should always be longer than any of the below), my example is set to 10 years
  2. WorkflowRunTimeout: how long to wait for a single run of a workflow to execute, my example is set to 10 years
  3. WorkflowTaskTimeout: how long to wait for a Decision Task to be executed (?), should be short? my example is set to 10 seconds
  4. Activity scheduleToCloseTimeout: should be at least as long as sum of scheduleToStart + scheduleToClose, my example is set to ~6 months
  5. Activity scheduleToStartTimeout: should be the amount of time to wait for the async human input, my example is set to ~3 months
  6. Activity startToCloseTimeout: should be the amount of time to wait for the async human input, my example is set to ~3 months
  7. What else can cause activity timeouts or cancellations and not be reflected in the UI/CLI or trigger a Timeout to be registered and workflow to proceed?

The history with two ActivityTaskStarted events in a row looks invalid. What database binding your service use? Can you reproduce the problem with v0.28.0 version of the service and the SDK?

Hi Maxim, thanks very much for your help.
The above was using Postgres. I am working on seeing if I can reproduce the issue when using Temporal v0.28.0 (instead of v0.26.0). I will report back.

Postgres is not ready for production and has known (and possibly a bunch of unknown) bugs. We will invest in its productization in the future, but for now stick with MySQL or Cassandra.

Hello, sorry for the late reply. The error did not manifest itself when using MySQL or Cassandra (using either Temporal v0.26.0 or v0.28.0), so we will be changing our choice of database. Thank you for your help.
Do you have any updates on when/whether Scylla might be supported? I saw some discussion in the Slack group that with the latest update to Scylla LWT, it could become an option, but hasn’t been tested fully yet. Any other news on that front?
Thank you!

1 Like

We certainly plan to evaluate Scylla, but not ETA yet. AFAK Scylla still had some showstopper bugs around LWT last time we had this conversation.

I see, makes sense. Thank you so much for your help!