Fail to process task - shard shut down

tsanydzul · January 3, 2025, 6:24am

I’m encountering an error within Temporal related to a workflow task failing to process. Here’s the relevant portion of the error log:

JSON

{
"level":"error",
"ts":"2024-08-30T08:24:00.921Z",
"msg":"Fail to process task",
"service":"history",
...
"error":"shard 32 for host 'XXXXX' is shut down",
...
}

JSON

{
"level":"error",
"ts":"2024-08-30T08:24:00.921Z",
"msg":"Critical error processing task, retrying.",
"service":"history",
"shard-id":92,
"address":"XXXXXX",
"shard-item":"0xc01fb3c580",
"component":"transfer-queue-processor",
"cluster-name":"active",
"shard-id":92,
"queue-task-id":31951255164,
"queue-task-visibility-timestamp":"2024-03-27T08:27:48.067Z",
"xdc-failover-version":0,
"queue-task-type":"TransferStartChildExecution",
"wf-namespace-id":"5803c5d0-4457-4741-9548-3cf5715c440d",
"wf-id":"A007240327082717488000000",
"wf-run-id":"1d733773-5f13-4078-bf54-ef28c2d8e167",
"error":"shard 32 for host 'XXXX' is shut down",
"operation-result":"OperationCritical",
"queue-task-type":"TransferStartChildExecution",
"logging-call-at":"taskProcessor.go:227",
"stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:227\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"
}

Context:

Temporal cluster name: active
Workflow namespace ID: 5803c5d0-4457-4741-9548-3cf5715c440d
Workflow ID: A007240327082717488000000
Workflow Run ID: 1d733773-5f13-4078-bf54-ef28c2d8e167

Question:

I’m trying to understand the root cause of this error. The log indicates that “shard 32 for host ‘XX.XX.XX’ is shut down”. Does this mean a shard responsible for processing the workflow task became unavailable?

What could cause a shard to shut down unexpectedly?
Are there any steps I can take to troubleshoot this issue further?
How can I prevent similar errors from happening in the future?

I appreciate any insights or suggestions the community can offer!

Thanks,

Tsany

Topic		Replies	Views
Disabling archival not working Community Support archival	1	811	June 11, 2021
Operation updateShard encounter timeout Community Support history	9	916	June 28, 2022
Activity not executed due to history fail to process task Community Support history	2	1765	November 12, 2021
Temporal history/matching service tuning Community Support history	1	1092	October 11, 2023
Help with failure detection: Task processing failed with error. Activity Argument size failure. Context deadline exceeded. WorkflowTaskTimedOut Community Support	1	555	September 27, 2023

Fail to process task - shard shut down

Related topics