Releasing Pending activities due to worker lost

Babak_Ghadiri · October 11, 2023, 3:04am

Hello all.
We run workers in kubernetes pods and we have CI/CD for updating worker pods. When an activity is scheduled on one of worker pods and pods rollouts to new ones, that activity remains in pending status untils time_to_close reaches (for assiging deleted worker pod to activity). How can we solve this problem?
one way is to add heartbeat to every activity
can we use worker identity parameter for this case? how?

Chad_Retz · October 12, 2023, 12:47pm

The activity should be picked up immediately when a worker is available. Yes, you should use heartbeat timeout and activity heartbeating for every non-immediate activity so you can detect worker/server crash.

Babak_Ghadiri · October 12, 2023, 1:03pm

activity is picked by a worker but that worker pod terminated and that activity stucks in pending status. Is there sth like worker heartbeat that “releases”/“failes” activities that are assigned to that worker?

Chad_Retz · October 12, 2023, 1:25pm

This is what activity heartbeat is for. If you gracefully shutdown the worker the activity will get a cancellation and report itself failed for worker shutdown. But you should always heartbeat from activities and set a heartbeat timeout to have the activity quickly fail on worker termination.

Babak_Ghadiri · October 12, 2023, 2:04pm

Thanks. What “gracefully shutdown” means here? Should we add some specific code for handling that (for example in python SDK) or SIGTERM/SIGKILL by k8s is handled by library?

Chad_Retz · October 12, 2023, 2:21pm

Yes, you should call shutdown() on the worker and not end the process until it or run() return. This will wait a grace period (configurable, default 0) before sending cancellation to all running activities and will then report that failure to the server so a retry can happen on another worker. If an activity for whatever reason swallows the cancellation and keeps running, shutdown may never complete. Here is a sample of a graceful worker shutdown on keyboard interrupt. (note there is a not-yet-released bug fix that may cause activity failure not to reach server if process ended immediately after shutdown returns, so you may want to sleep for a second or so until that is released)

Topic		Replies	Views
Quickly retry activities on different worker after graceful shutdown Community Support go-sdk	7	880	August 22, 2024
Clean shutdown of activity workers Community Support java-sdk	0	549	March 3, 2023
Proper termination of a worker deployed on Kubernetes? Community Support	4	3228	November 18, 2020
Proper termination of a worker in Python in Kubernetes Community Support python-sdk , general-impl , kubernetes	13	1312	May 9, 2025
Graceful Shutdown of temporal workers not marking activity as complete Community Support go-sdk , php-sdk	1	2727	August 5, 2021

Releasing Pending activities due to worker lost

Related topics