Temporal Pod Abruptly Restarted

Dakshay_S · February 7, 2025, 12:49pm

We encountered an unusual issue with our Temporal workers deployed on Kubernetes (3 pods handling the same task queues). One pod restarted abruptly without receiving a SIGTERM signal. There were no error traces or panics in the logs (panic recovery is enabled in the worker). We reviewed the pod’s memory and CPU profile at the time of the restart, but found no anomalies. This only affected one of the three pods; the other two remained healthy. The worker was down for about 2 minutes (from 12:38:53 to 12:41:10).

We’re seeking insights into possible causes for such unexpected restarts and any precautions (related to Temporal configurations or otherwise) to prevent this from happening again. The logs are attached for reference.

tihomir · February 8, 2025, 4:26pm

Temporal worker would not shut down your pod. If you say your resource utilization was not the cause maybe the node became for some reason unavailable and was recreated?

Dakshay_S · February 10, 2025, 12:06pm

While while pods restarted is one concern.

Through this forum, we are trying to learn why temporal didn’t mark the activities/workflow as completed even after the pod restarted again. Why did it timeout? Why was it not picked by other workers in the pool

tihomir · February 10, 2025, 2:29pm

Workflow tasks that this worker was processing when it was shut down would hit workflow task timeout and be retried by service. Activities this worker was processing at the time it was shut down should hit either heartbeat timeout if they heartbeat or its StartToClose timeout at some point and then be retried by service based on your configured retry policy. If service according to it can retry it again, it would and the retry would be processed by another available worker that polls activity tasks on that task queue.

Please share what you are seeing thats different than expected.

Dakshay_S · February 11, 2025, 7:10am

Providing more context for clarity.

This is the retry-policy and activity-options we are using for the workflow:

retryPolicy = &temporal.RetryPolicy{
		InitialInterval:        0,
		BackoffCoefficient:     1,
		MaximumInterval:        0,
		MaximumAttempts:        3,
		NonRetryableErrorTypes: []string{"PanicError"},
}

.
.
.

func (w *XYZWorkflow) StartXYZWorkflow(ctx workflow.Context, workflowInput *workflowInput) (*workflowOutput, error) {
	ctx = workflow.WithActivityOptions(ctx,
		workflow.ActivityOptions{
			ScheduleToCloseTimeout: 30000 * time.Second, 
			RetryPolicy:            retryPolicy,
		},
	)

	validationOutput := GetValidationOutput{}
	cwf := workflow.ExecuteActivity(ctx,
		w.wfActivities.ValidatedMandatoryFields,
		workflowInput,
	)
..
..
..
// other activities

And we are triggering the temporal-WF like this:

c.temporalClient.ExecuteWorkflow(context.Background(),
			client.StartWorkflowOptions{
				TaskQueue:             config.GetTemporalConfig().XYZWorkflowQueue,
				ID:                    workflowId,
				WorkflowIDReusePolicy: enums.WORKFLOW_ID_REUSE_POLICY_REJECT_DUPLICATE,
			},
			c.workflows.StartXYZWorkflow,
			&xyz_workflow.workflowInput{
				Field1:   "abc",
				FielId2: "yxz",
			},
		)

Questions

With the above configuration, we expect temporal to pick up the hanging/failed/timedout activity and execute it to completion and proceed to next activities. but this does not seem to have happened. The activity failed with ScheduleToClose timeout and was not retried
You mentioned above that service will retry if activities timeout, does it retry the workflow or activity

Screenshot 2025-02-11 at 12.31.03 PM2432×212 50.7 KB

maxim · February 11, 2025, 8:44am

This activity doesn’t have StartToCloseTimeout specified. This timeout limits duration of a single activity attempt. When exceeded the activity is retried according to its retry policy. When not specified it defaults to the scheduleToClose timeout. Which means that activity is not retried on worker restart

Dakshay_S · February 11, 2025, 10:14am

Other than adding the StartToCloseTimeout like below, would anything else be required to enable safe and guarenteed retry of activities ? (All our temporal activities are already idempotent)

maxim · February 11, 2025, 6:46pm

This timeout is enough.

Your retry policy looks weird. Do you really want to fail the activity after three attempts?

Dakshay_S · February 12, 2025, 5:55am

We want to cap the number of retries to avoid unexpected resource spikes in case of any intermittent issue. 3 retries seems reasonable for now for our use-case

Topic		Replies	Views
Restarted and New Pods Not Picking Up Old Workflows from Task Queue in K8s Cluster Community Support general-impl , typescript-sdk	11	858	September 5, 2023
Potential deadlock detected Community Support java-sdk	4	4363	December 2, 2022
When it startToCloseTimeout happens - Will Temporal Restart the worker server on a system crash? Community Support retries , typescript-sdk	4	59	June 7, 2025
Worker does not start activity after restart Community Support go-sdk , retries , worker	17	3431	May 24, 2021
Activity Retry after Worker restart Community Support retries	4	863	July 2, 2021

Temporal Pod Abruptly Restarted

Questions

Related topics