Scheduled WorkflowTasks occasionally not starting

Over the past few weeks, we’ve noticed a handful of our Workflows, all running on the same task queue, have had their first WorkflowTask scheduled, but never started. We’ve seen it happen five times in one our namespaces, three times on one particular day, each about two hours apart, and then once three days later, and one more time a day after that.

Other workflows running on the same task queue have run successfully around the same time/after these workflows got stuck, and there is a worker listening to that task queue. Can someone explain why this small group are not making progress?

Here is some relevant output from temporal CLI, showing one of the stuck workflows, and info about the task queue (with some minor edits to the names of things):

temporal@temporal-frontend-65b4df7b7b-ccfdp:/etc/temporal$ temporal workflow show -w ffdf6b5b-aee9-35da-81d2-4b4509384c1b
Progress:
  ID          Time                        Type                
   1  2024-05-20T06:08:20Z  WorkflowExecutionStarted          
   2  2024-05-20T06:08:20Z  WorkflowTaskScheduled             
   3  2024-05-20T07:04:30Z  WorkflowExecutionCancelRequested  

Result:

temporal@temporal-frontend-65b4df7b7b-ccfdp:/etc/temporal$ temporal workflow describe -w ffdf6b5b-aee9-35da-81d2-4b4509384c1b
{
  "executionConfig": {
    "taskQueue": {
      "name": "yyy.zzz.task.queue",
      "kind": "Normal"
    },
    "workflowExecutionTimeout": "0s",
    "workflowRunTimeout": "0s",
    "defaultWorkflowTaskTimeout": "10s"
  },
  "workflowExecutionInfo": {
    "execution": {
      "workflowId": "ffdf6b5b-aee9-35da-81d2-4b4509384c1b",
      "runId": "3951299c-9ec4-47ad-bbb3-927d23754933"
    },
    "type": {
      "name": "GetYYYWorkflow"
    },
    "startTime": "2024-05-20T06:08:20.176719278Z",
    "status": "Running",
    "historyLength": "3",
    "parentNamespaceId": "4a379ce6-7259-4e1d-80c2-bbadbf6bb725",
    "parentExecution": {
      "workflowId": "YYY:j2wSQHCPIy/6ajRhDOoAKuJoq9Qdg59kVTrVAT8SMyg=",
      "runId": "41587bf8-7eb0-4e57-a066-9d14c4bd0004"
    },
    "executionTime": "2024-05-20T06:08:20.176719278Z",
    "memo": {

    },
    "autoResetPoints": {

    },
    "stateTransitionCount": "3",
    "historySizeBytes": "12062"
  },
  "pendingWorkflowTask": {
    "state": "Scheduled",
    "scheduledTime": "2024-05-20T06:08:20.200742918Z",
    "originalScheduledTime": "2024-05-20T06:08:20.200742661Z",
    "attempt": 1
  }
}

temporal@temporal-frontend-65b4df7b7b-ccfdp:/etc/temporal$ temporal task-queue describe -t "yyy.zzz.task.queue"
                  Identity                  LastAccessTime  RatePerSecond  
  1@my-worker-zzzzzzzzzzzz-767486cfc-9rc5z  25 seconds ago         100000  
  
temporal@temporal-frontend-65b4df7b7b-ccfdp:/etc/temporal$ temporal task-queue get-build-ids -t "yyy.zzz.task.queue"
        BuildIds          DefaultForSet    IsDefaultSet  
  [my-service-1.973.0  my-service-1.985.0  true          
  my-service-1.977.0                                                 
  my-service-1.985.0]
  
temporal@temporal-frontend-65b4df7b7b-ccfdp:/etc/temporal$ temporal task-queue get-build-id-reachability -t "yyy.zzz.task.queue" --build-id my-service-1.985.0
       BuildId             TaskQueue               Reachability           
  my-service-1.985.0  yyy.zzz.task.queue  [NewWorkflows                   
                                          ExistingWorkflows]

As a point of comparison, here is the describe output for an execution of the same workflow type, on the same task queue, several days later:

temporal@temporal-frontend-65b4df7b7b-ccfdp:/etc/temporal$ temporal workflow describe -w 7e74a0dc-25cc-314e-aa2c-4f741718f074
{
  "executionConfig": {
    "taskQueue": {
      "name": "yyy.zzz.task.queue",
      "kind": "Normal"
    },
    "workflowExecutionTimeout": "0s",
    "workflowRunTimeout": "0s",
    "defaultWorkflowTaskTimeout": "10s"
  },
  "workflowExecutionInfo": {
    "execution": {
      "workflowId": "7e74a0dc-25cc-314e-aa2c-4f741718f074",
      "runId": "d2df027d-e3f5-4941-9fb5-f621eaf15421"
    },
    "type": {
      "name": "GetYYYWorkflow"
    },
    "startTime": "2024-05-28T11:12:54.087547236Z",
    "closeTime": "2024-05-28T11:12:54.580375549Z",
    "status": "Completed",
    "historyLength": "15",
    "parentNamespaceId": "4a379ce6-7259-4e1d-80c2-bbadbf6bb725",
    "parentExecution": {
      "workflowId": "YYY:f26lz90mlkPuaTXfZebfzsv/fqw++dNc8Lz5ryq00kQ=",
      "runId": "37165d3d-4e25-4f70-8e8e-1af8f984258e"
    },
    "executionTime": "2024-05-28T11:12:54.087547236Z",
    "memo": {

    },
    "searchAttributes": {
      "indexedFields": {
        "BuildIds": "[\"versioned:my-service-1.977.0\"]"
      }
    },
    "autoResetPoints": {
      "points": [
        {
          "binaryChecksum": "my-service-1.977.0",
          "runId": "d2df027d-e3f5-4941-9fb5-f621eaf15421",
          "firstWorkflowTaskCompletedId": "4",
          "createTime": "2024-05-28T11:12:54.224141722Z",
          "resettable": true
        }
      ]
    },
    "stateTransitionCount": "8",
    "historySizeBytes": "27301",
    "mostRecentWorkerVersionStamp": {
      "buildId": "my-service-1.977.0",
      "useVersioning": true
    }
  }
}

We’re using Postgres for persistence, and running Temporal server 1.21.5. Thanks!

Hey dano, do you happen to have your history service’s logs and metrics from around this time? If so please make a copy so we can help you identify what went wrong :slight_smile:

Do you see either of Task Processor: task event ID >= MS NextEventID, skip or a non-zero task_skipped metric around that time?

I’ll have to get back to you on the metrics data, but the temporal-history logs from the five seconds before and after the workflow got scheduled don’t include that message, or any other message related to the namespace we observed the issue on. I just see a handful of “Critical attempts processing workflow task” messages, but they’re all for workflows running in other namespaces.

This suggest that the transfer task to generate the workflow task was not executed.

What are the utilization level of your DB around that time? Is there any abnormal logs from DB during that time? What kind of DB setup do you have, I assume there are at least 3 replicas?

From the DB metrics I have access to, I don’t see any unusual spikes in CPU or memory utilization at all on 5/20, nor are there any unusual logs. We’re running a Postgres Replica Set with two nodes, using async replication.

I am told that we currently don’t persist most temporal metrics for more than 24 hours, so the task_skipped metrics from around the time the workflow tasks got stuck are lost at this point.

I’m not an expert of PostgreSQL. But isn’t this might cause data inconsistent?
For example, you write to the primary node and read from the replica who has not got the replication yet?

See more here: SQL replication capability for temporal postgres db - #4 by tihomir

We don’t allow any reads to go to the standby node. We could have data inconsistencies if we needed to do an unexpected failover, but there haven’t been any failovers during the time period we’ve seen these issues.

To clarify, this happening in our development environment. For our production environment, we have three nodes connected via synchronous replication.

Have you experience similar issue in prod environment?

Would it be possible to change your dev environment to have same setup as prod and see if that solve the problem?

Is there something we can look for in the logs or elsewhere that will confirm that this is the case?

Edit:
Other members of my time will get back to you regarding your question about our DB configuration.

I can say that we have not seen this in production. In dev we have over a dozen namespaces, and each of them have many task queues, and we are only seeing this happen to one task queue in one namespace. It is still reproducing - it happened again a few hours ago. This time we were able to look at the temporal metrics, and there is no task_skipped metric being reported at all.

You can check the current persisted task ack level for the shard where the impacted workflow lives and compare it to the lastEventTaskId of the workflow. You should see the ack level already moved ahead of the latest task ID of the workflow.
Steps:

  1. Figure out which shard_id the impacted workflow belongs to:
    tdbg --namespace $YOUR_NAMESPACE workflow describe --workflow-id $YOUR_WORKFLOW_ID
    The last line of output would print the shard ID. Also pay attention to the “lastEventTaskId”. This indicate the largest task ID generated by this workflow.

  2. tdbg shard d --shard-id $SHARD_ID
    You could see something like this:
    "queueStates": { "1": { "exclusiveReaderHighWatermark": { "fireTime": "1970-01-01T00:00:00Z", "taskId": "1341510" } },
    Notice that “1” represent TransferQueue, and the taskId:1341510 represent acked task ID. My guess is that this taskID is already ahead of the lastEventTaskId from your workflow.

tdbg is a tool included in the admin-tools docker image. You could also build it from code if you want.

Note:
Temporal requires DB to be strong consistent (read-after-write consistent). Task could be lost if strong consistent is not guaranteed.

@Yimin_Chen I work with @dano and have been trying to debug the issue at our end. I ran the tdbg commands you mentioned and your guess is right. The lastEventTaskId is 106954799 and the acked task id is 106959213 which seems to be much higher.

Regarding the DB being consistent - We always point to the primary and don’t do any Reads from any other replica. This is a simple Postgres ReplicaSet with an async replica. I would imagine this would guarantee strong consistency.
At the time of our previous run, we saw some error mesages in the matching service related to persistent store operations -

“Failed to lock task queue. Error: context canceled”

There was a minor spike in the task_errors for the matching service at the same time.
Also, as @dano mentioned - we have >100 namespaces on this cluster (dev) and only this namespace and task queue in particular is affected.

I think the bug you’re hitting has to do with worker versioning. It looks similar to a bug I’ve seen before but didn’t track down.

A couple questions:

  1. After seeing Failed to lock task queue. Error: context canceled you should have seen a log with Persistent store operation failure. And then (if I’m reading correctly) you should not see any lifecycle logs around that time (i.e. the task queue did not get unloaded/reloaded). Does that look right?

  2. It seems like the task probably got lost in matching. It would be helpful to know if it’s sitting in a task queue and if so, which one.

Would you be able to try some commands like this:

taskqueue="yyy.zzz.task.queue"  # replace with real name
namespace="your.namespace"      # replace with real one
flags="--task-queue-type TASK_QUEUE_TYPE_WORKFLOW --min-task-id 0 --max-task-id 1000000000"

echo "unversioned partition:0"
tdbg -n "$namespace" tq list-tasks --task-queue "$taskqueue" $flags
for i in 1 2 3; do
  echo "unversioned partition:$i"
  tdbg -n "$namespace" tq list-tasks --task-queue "/_sys/$taskqueue/$i" $flags
done

for buildid in my-service-1.977.0 my-service-1.985.0 other build ids here; do
  setid=$(echo -n "$buildid" | cksum -a sha256 --raw | basenc --base64url | cut -c-20)
  for i in 0 1 2 3; do
    echo "versioned build:$buildid set:$setid partition:$i"
    tdbg -n "$namespace" tq list-tasks --task-queue "/_sys/$taskqueue/$setid:$i" $flags
  done
done

and say if something that looks like the missing task shows up there? You can add --print-json if that output is easier to deal with. (If it’s complaining about TASK_QUEUE_TYPE_WORKFLOW, try just workflow)

@dnr for your first question, should we only be paying attention to logs with that message for the task queue the stuck workflows are assigned to? I’m looking at one particular case where we had a workflow get stuck, and the closest “Persistent store operation failure” logs caused by the “Failed to lock task queue” error for the associated task queue is 8 hours prior (for the activity queue), and the closest for the workflow queue is ~18 hours prior. There are some other task queues in the same namespace that hit the failure as well, but the closest one occurs about 40 minutes before the stuck workflow was scheduled.

The stuck workflow was scheduled at 14:09:13.71, and these are the only logs from the matching service for that task queue close to that time window:

{"level":"info","ts":"2024-06-05T14:09:26.168Z","msg":"none","service":"matching","component":"matching-engine","wf-task-queue-name":"x.y.z.task.queue","wf-task-queue-type":"Workflow","wf-namespace":"my-namespace","lifecycle":"Started","logging-call-at":"taskQueueManager.go:320"}
{"level":"info","ts":"2024-06-05T14:09:25.503Z","msg":"Taskqueue manager context is cancelled, shutting down","service":"matching","component":"matching-engine","wf-task-queue-name":"x.y.z.task.queue","wf-task-queue-type":"Workflow","wf-namespace":"my-namespace","logging-call-at":"taskReader.go:157"}
{"level":"info","ts":"2024-06-05T14:09:25.503Z","msg":"none","service":"matching","component":"matching-engine","wf-task-queue-name":"x.y.z.task.queue","wf-task-queue-type":"Workflow","wf-namespace":"my-namespace","lifecycle":"Stopped","logging-call-at":"taskQueueManager.go:350"}
{"level":"info","ts":"2024-06-05T14:09:25.167Z","msg":"none","service":"matching","component":"matching-engine","wf-task-queue-name":"x.y.z.task.queue","wf-task-queue-type":"Activity","wf-namespace":"my-namespace","lifecycle":"Started","logging-call-at":"taskQueueManager.go:320"}
{"level":"info","ts":"2024-06-05T14:09:13.290Z","msg":"Taskqueue manager context is cancelled, shutting down","service":"matching","component":"matching-engine","wf-task-queue-name":"x.y.z.task.queue","wf-task-queue-type":"Activity","wf-namespace":"my-namespace","logging-call-at":"taskReader.go:157"}
{"level":"info","ts":"2024-06-05T14:09:13.290Z","msg":"none","service":"matching","component":"matching-engine","wf-task-queue-name":"x.y.z.task.queue","wf-task-queue-type":"Activity","wf-namespace":"my-namespace","lifecycle":"Stopped","logging-call-at":"taskQueueManager.go:350"}
{"level":"info","ts":"2024-06-05T14:08:01.255Z","msg":"none","service":"matching","component":"matching-engine","wf-task-queue-name":"/_sys/x.y.z.task.queue/3","wf-task-queue-type":"Workflow","wf-namespace":"my-namespace","lifecycle":"Started","logging-call-at":"taskQueueManager.go:320"}
Show syntax highlighted
{"level":"info","ts":"2024-06-05T14:07:11.008Z","msg":"none","service":"matching","component":"matching-engine","wf-task-queue-name":"/_sys/x.y.z.task.queue/1","wf-task-queue-type":"Workflow","wf-namespace":"my-namespace","lifecycle":"Started","logging-call-at":"taskQueueManager.go:320"}

I don’t have access to run the tdbg commands you mentioned, so we’ll have to wait for @KumarPrashant for that.

@dnr Sorry for the delay. I was able to run your test script (with some slight modifications: my system has an older version of cksum, so I used sha256sum | xxd -r -p instead, and --task-queue-type workflow instead of --task-queue-type TASK_QUEUE_TYPE_WORKFLOW). None of the task queues showed any tasks in them.

Hi @dano,

I understand this investigation is still going on. But as a potential solution please consider migrating to the new Worker Versioning API that we just published.

A known bug was present in the old Worker Versioning API that would’ve cause the stuck workflow issue you are facing. The new API doesn’t have the mentioned bug and has more capabilities.

Please note that you’d need to upgrade to Temporal Server 1.24 for this new API to work.

Here are the docs: temporal/docs/worker-versioning.md at main · temporalio/temporal · GitHub

Thanks, @ShahabT. We do plan to move to the new Worker Versioning API, but we had planned to wait until it was clear the API had reached a point where it wouldn’t be getting additional backwards incompatible changes. Maybe we’ll adjust that plan if it’s going to fix this bug.

That said, I think we’d want to wait for the Java SDK to add support for the new versioning APIs, whether we migrate now or after the API stabilizes. Currently we have each of our (Java) microservices programmatically configure their version set compatibility rules for the task queue they’re going to poll during startup, and it’d probably be easier for us to migrate that code to configure assignment/redirect rules, rather than switch to using the CLI to do it.