Fail to process task - shard shut down

I’m encountering an error within Temporal related to a workflow task failing to process. Here’s the relevant portion of the error log:

JSON

{
"level":"error",
"ts":"2024-08-30T08:24:00.921Z",
"msg":"Fail to process task",
"service":"history",
...
"error":"shard 32 for host 'XXXXX' is shut down",
...
}

JSON

{
"level":"error",
"ts":"2024-08-30T08:24:00.921Z",
"msg":"Critical error processing task, retrying.",
"service":"history",
"shard-id":92,
"address":"XXXXXX",
"shard-item":"0xc01fb3c580",
"component":"transfer-queue-processor",
"cluster-name":"active",
"shard-id":92,
"queue-task-id":31951255164,
"queue-task-visibility-timestamp":"2024-03-27T08:27:48.067Z",
"xdc-failover-version":0,
"queue-task-type":"TransferStartChildExecution",
"wf-namespace-id":"5803c5d0-4457-4741-9548-3cf5715c440d",
"wf-id":"A007240327082717488000000",
"wf-run-id":"1d733773-5f13-4078-bf54-ef28c2d8e167",
"error":"shard 32 for host 'XXXX' is shut down",
"operation-result":"OperationCritical",
"queue-task-type":"TransferStartChildExecution",
"logging-call-at":"taskProcessor.go:227",
"stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:227\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"
}

Context:

  • Temporal cluster name: active
  • Workflow namespace ID: 5803c5d0-4457-4741-9548-3cf5715c440d
  • Workflow ID: A007240327082717488000000
  • Workflow Run ID: 1d733773-5f13-4078-bf54-ef28c2d8e167

Question:

I’m trying to understand the root cause of this error. The log indicates that “shard 32 for host ‘XX.XX.XX’ is shut down”. Does this mean a shard responsible for processing the workflow task became unavailable?

  • What could cause a shard to shut down unexpectedly?
  • Are there any steps I can take to troubleshoot this issue further?
  • How can I prevent similar errors from happening in the future?

I appreciate any insights or suggestions the community can offer!

Thanks,

Tsany