Workflows completing early

We noticed that some of our workflows were completing early without error or timeout and when checking our temporal server logs for the same time period we see many errors like below.

Around this time we were starting on the order of 100K abandoned child workflows that execute an activity that polls for status by returning error and retrying until the desired status is received.

Is this something we can address with a configuration change?
Please let us now if more information if needed.

{ [-]
   address: xxx.xxx.xxx.xxx:7234
   cluster-name: active
   component: timer-queue-processor
   error: context deadline exceeded
   level: error
   lifecycle: ProcessingFailed
   logging-call-at: taskProcessor.go:326
   msg: Fail to process task
   queue-task-id: 133055103
   queue-task-type: ActivityRetryTimer
   queue-task-visibility-timestamp: 1625173712064589600
   service: history
   shard-id: 4
   shard-item: 0xc000cf6580
   stacktrace: go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error
	/temporal/common/log/loggerimpl/logger.go:138
go.temporal.io/server/service/history.(*taskProcessor).handleTaskError
	/temporal/service/history/taskProcessor.go:326
go.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1
	/temporal/service/history/taskProcessor.go:212
go.temporal.io/server/common/backoff.Retry
	/temporal/common/backoff/retry.go:103
go.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck
	/temporal/service/history/taskProcessor.go:238
go.temporal.io/server/service/history.(*taskProcessor).taskWorker
	/temporal/service/history/taskProcessor.go:161
   ts: 2021-07-01T21:14:05.130Z
   wf-id: 29bc0b91-3547-473e-9844-bd48b70d9745_17
   wf-namespace-id: c27e3385-bb4c-45f2-abc6-e7d20a76028e
   wf-run-id: b6673fab-d36e-4b67-9abe-6502514fb8e7
   wf-timeout-type: Unspecified
   xdc-failover-version: 0
}

Starting a child workflow is a cross shard operation. Looks like you are getting timeouts which means task responsible for starting a child workflow could not be completed, most likely because the other shard is busy. My guess is your Temporal server is not provisioned with enough shards to handle the workload. Can you confirm how total number of shards in the cluster? Unfortunately if this number is low then there is no way currently to increase the number of shards. You have to provision a new cluster with higher shard count.

Thanks for the quick reply. We will look into increasing the number of shards. Is there any documentation available to assist with deciding on a number and applying the update? We are using MySQL.

Here is a good blog post on this topic.

Thanks Samar. Is it expected behavior in this case that the workflows completes without any error exposed to the workflow or activity code.