We encountered an unusual issue with our Temporal workers deployed on Kubernetes (3 pods handling the same task queues). One pod restarted abruptly without receiving a SIGTERM signal. There were no error traces or panics in the logs (panic recovery is enabled in the worker). We reviewed the pod’s memory and CPU profile at the time of the restart, but found no anomalies. This only affected one of the three pods; the other two remained healthy. The worker was down for about 2 minutes (from 12:38:53 to 12:41:10).
We’re seeking insights into possible causes for such unexpected restarts and any precautions (related to Temporal configurations or otherwise) to prevent this from happening again. The logs are attached for reference.
Temporal worker would not shut down your pod. If you say your resource utilization was not the cause maybe the node became for some reason unavailable and was recreated?
Through this forum, we are trying to learn why temporal didn’t mark the activities/workflow as completed even after the pod restarted again. Why did it timeout? Why was it not picked by other workers in the pool
Workflow tasks that this worker was processing when it was shut down would hit workflow task timeout and be retried by service. Activities this worker was processing at the time it was shut down should hit either heartbeat timeout if they heartbeat or its StartToClose timeout at some point and then be retried by service based on your configured retry policy. If service according to it can retry it again, it would and the retry would be processed by another available worker that polls activity tasks on that task queue.
Please share what you are seeing thats different than expected.
With the above configuration, we expect temporal to pick up the hanging/failed/timedout activity and execute it to completion and proceed to next activities. but this does not seem to have happened. The activity failed with ScheduleToClose timeout and was not retried
You mentioned above that service will retry if activities timeout, does it retry the workflow or activity
This activity doesn’t have StartToCloseTimeout specified. This timeout limits duration of a single activity attempt. When exceeded the activity is retried according to its retry policy. When not specified it defaults to the scheduleToClose timeout. Which means that activity is not retried on worker restart
Other than adding the StartToCloseTimeout like below, would anything else be required to enable safe and guarenteed retry of activities ? (All our temporal activities are already idempotent)
We want to cap the number of retries to avoid unexpected resource spikes in case of any intermittent issue. 3 retries seems reasonable for now for our use-case