We run workers in kubernetes pods and we have CI/CD for updating worker pods. When an activity is scheduled on one of worker pods and pods rollouts to new ones, that activity remains in pending status untils time_to_close reaches (for assiging deleted worker pod to activity). How can we solve this problem?
one way is to add heartbeat to every activity
can we use worker identity parameter for this case? how?
The activity should be picked up immediately when a worker is available. Yes, you should use heartbeat timeout and activity heartbeating for every non-immediate activity so you can detect worker/server crash.
activity is picked by a worker but that worker pod terminated and that activity stucks in pending status. Is there sth like worker heartbeat that “releases”/“failes” activities that are assigned to that worker?
This is what activity heartbeat is for. If you gracefully shutdown the worker the activity will get a cancellation and report itself failed for worker shutdown. But you should always heartbeat from activities and set a heartbeat timeout to have the activity quickly fail on worker termination.
Thanks. What “gracefully shutdown” means here? Should we add some specific code for handling that (for example in python SDK) or SIGTERM/SIGKILL by k8s is handled by library?
Yes, you should call
shutdown() on the worker and not end the process until it or
run() return. This will wait a grace period (configurable, default 0) before sending cancellation to all running activities and will then report that failure to the server so a retry can happen on another worker. If an activity for whatever reason swallows the cancellation and keeps running, shutdown may never complete. Here is a sample of a graceful worker shutdown on keyboard interrupt. (note there is a not-yet-released bug fix that may cause activity failure not to reach server if process ended immediately after shutdown returns, so you may want to sleep for a second or so until that is released)