Workflows completing early

Nate_Falk · July 3, 2021, 2:27am

We noticed that some of our workflows were completing early without error or timeout and when checking our temporal server logs for the same time period we see many errors like below.

Around this time we were starting on the order of 100K abandoned child workflows that execute an activity that polls for status by returning error and retrying until the desired status is received.

Is this something we can address with a configuration change?
Please let us now if more information if needed.

{ [-]
   address: xxx.xxx.xxx.xxx:7234
   cluster-name: active
   component: timer-queue-processor
   error: context deadline exceeded
   level: error
   lifecycle: ProcessingFailed
   logging-call-at: taskProcessor.go:326
   msg: Fail to process task
   queue-task-id: 133055103
   queue-task-type: ActivityRetryTimer
   queue-task-visibility-timestamp: 1625173712064589600
   service: history
   shard-id: 4
   shard-item: 0xc000cf6580
   stacktrace: go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error
	/temporal/common/log/loggerimpl/logger.go:138
go.temporal.io/server/service/history.(*taskProcessor).handleTaskError
	/temporal/service/history/taskProcessor.go:326
go.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1
	/temporal/service/history/taskProcessor.go:212
go.temporal.io/server/common/backoff.Retry
	/temporal/common/backoff/retry.go:103
go.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck
	/temporal/service/history/taskProcessor.go:238
go.temporal.io/server/service/history.(*taskProcessor).taskWorker
	/temporal/service/history/taskProcessor.go:161
   ts: 2021-07-01T21:14:05.130Z
   wf-id: 29bc0b91-3547-473e-9844-bd48b70d9745_17
   wf-namespace-id: c27e3385-bb4c-45f2-abc6-e7d20a76028e
   wf-run-id: b6673fab-d36e-4b67-9abe-6502514fb8e7
   wf-timeout-type: Unspecified
   xdc-failover-version: 0
}

samar · July 3, 2021, 2:33am

Starting a child workflow is a cross shard operation. Looks like you are getting timeouts which means task responsible for starting a child workflow could not be completed, most likely because the other shard is busy. My guess is your Temporal server is not provisioned with enough shards to handle the workload. Can you confirm how total number of shards in the cluster? Unfortunately if this number is low then there is no way currently to increase the number of shards. You have to provision a new cluster with higher shard count.

Nate_Falk · July 3, 2021, 2:49am

Thanks for the quick reply. We will look into increasing the number of shards. Is there any documentation available to assist with deciding on a number and applying the update? We are using MySQL.

samar · July 3, 2021, 4:28pm

Here is a good blog post on this topic.

Nate_Falk · July 16, 2021, 5:31pm

Thanks Samar. Is it expected behavior in this case that the workflows completes without any error exposed to the workflow or activity code.

Topic		Replies	Views
Detect not enough shards Community Support database , workflow-options	1	506	July 23, 2021
Starting 100K workflows at the same time caused this Community Support java-sdk	4	1548	July 9, 2020
Temporal Sever errors ; workflow failures and all request to history client failed Community Support	6	1245	May 8, 2023
Problems emitting 10,000 activity tasks in a workflow Community Support	3	555	October 9, 2020
Activity Timeouts and child workflows Community Support child-workflow , timeout	2	906	July 15, 2021

Workflows completing early

Related topics