Temporal Pod Abruptly Restarted

We encountered an unusual issue with our Temporal workers deployed on Kubernetes (3 pods handling the same task queues). One pod restarted abruptly without receiving a SIGTERM signal. There were no error traces or panics in the logs (panic recovery is enabled in the worker). We reviewed the pod’s memory and CPU profile at the time of the restart, but found no anomalies. This only affected one of the three pods; the other two remained healthy. The worker was down for about 2 minutes (from 12:38:53 to 12:41:10).

We’re seeking insights into possible causes for such unexpected restarts and any precautions (related to Temporal configurations or otherwise) to prevent this from happening again. The logs are attached for reference.

Temporal worker would not shut down your pod. If you say your resource utilization was not the cause maybe the node became for some reason unavailable and was recreated?

While while pods restarted is one concern.

Through this forum, we are trying to learn why temporal didn’t mark the activities/workflow as completed even after the pod restarted again. Why did it timeout? Why was it not picked by other workers in the pool

Workflow tasks that this worker was processing when it was shut down would hit workflow task timeout and be retried by service. Activities this worker was processing at the time it was shut down should hit either heartbeat timeout if they heartbeat or its StartToClose timeout at some point and then be retried by service based on your configured retry policy. If service according to it can retry it again, it would and the retry would be processed by another available worker that polls activity tasks on that task queue.

Please share what you are seeing thats different than expected.

Providing more context for clarity.

This is the retry-policy and activity-options we are using for the workflow:

retryPolicy = &temporal.RetryPolicy{
		InitialInterval:        0,
		BackoffCoefficient:     1,
		MaximumInterval:        0,
		MaximumAttempts:        3,
		NonRetryableErrorTypes: []string{"PanicError"},
}

.
.
.

func (w *XYZWorkflow) StartXYZWorkflow(ctx workflow.Context, workflowInput *workflowInput) (*workflowOutput, error) {
	ctx = workflow.WithActivityOptions(ctx,
		workflow.ActivityOptions{
			ScheduleToCloseTimeout: 30000 * time.Second, 
			RetryPolicy:            retryPolicy,
		},
	)

	validationOutput := GetValidationOutput{}
	cwf := workflow.ExecuteActivity(ctx,
		w.wfActivities.ValidatedMandatoryFields,
		workflowInput,
	)
..
..
..
// other activities

And we are triggering the temporal-WF like this:

c.temporalClient.ExecuteWorkflow(context.Background(),
			client.StartWorkflowOptions{
				TaskQueue:             config.GetTemporalConfig().XYZWorkflowQueue,
				ID:                    workflowId,
				WorkflowIDReusePolicy: enums.WORKFLOW_ID_REUSE_POLICY_REJECT_DUPLICATE,
			},
			c.workflows.StartXYZWorkflow,
			&xyz_workflow.workflowInput{
				Field1:   "abc",
				FielId2: "yxz",
			},
		)

Questions

  • With the above configuration, we expect temporal to pick up the hanging/failed/timedout activity and execute it to completion and proceed to next activities. but this does not seem to have happened. The activity failed with ScheduleToClose timeout and was not retried
  • You mentioned above that service will retry if activities timeout, does it retry the workflow or activity

This activity doesn’t have StartToCloseTimeout specified. This timeout limits duration of a single activity attempt. When exceeded the activity is retried according to its retry policy. When not specified it defaults to the scheduleToClose timeout. Which means that activity is not retried on worker restart

Other than adding the StartToCloseTimeout like below, would anything else be required to enable safe and guarenteed retry of activities ? (All our temporal activities are already idempotent)

This timeout is enough.

Your retry policy looks weird. Do you really want to fail the activity after three attempts?

We want to cap the number of retries to avoid unexpected resource spikes in case of any intermittent issue. 3 retries seems reasonable for now for our use-case