We noticed that some of our workflows were completing early without error or timeout and when checking our temporal server logs for the same time period we see many errors like below.
Around this time we were starting on the order of 100K abandoned child workflows that execute an activity that polls for status by returning error and retrying until the desired status is received.
Is this something we can address with a configuration change?
Please let us now if more information if needed.
{ [-]
address: xxx.xxx.xxx.xxx:7234
cluster-name: active
component: timer-queue-processor
error: context deadline exceeded
level: error
lifecycle: ProcessingFailed
logging-call-at: taskProcessor.go:326
msg: Fail to process task
queue-task-id: 133055103
queue-task-type: ActivityRetryTimer
queue-task-visibility-timestamp: 1625173712064589600
service: history
shard-id: 4
shard-item: 0xc000cf6580
stacktrace: go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error
/temporal/common/log/loggerimpl/logger.go:138
go.temporal.io/server/service/history.(*taskProcessor).handleTaskError
/temporal/service/history/taskProcessor.go:326
go.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1
/temporal/service/history/taskProcessor.go:212
go.temporal.io/server/common/backoff.Retry
/temporal/common/backoff/retry.go:103
go.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck
/temporal/service/history/taskProcessor.go:238
go.temporal.io/server/service/history.(*taskProcessor).taskWorker
/temporal/service/history/taskProcessor.go:161
ts: 2021-07-01T21:14:05.130Z
wf-id: 29bc0b91-3547-473e-9844-bd48b70d9745_17
wf-namespace-id: c27e3385-bb4c-45f2-abc6-e7d20a76028e
wf-run-id: b6673fab-d36e-4b67-9abe-6502514fb8e7
wf-timeout-type: Unspecified
xdc-failover-version: 0
}