Proper termination of a worker deployed on Kubernetes?

Hello,

I am deploying my (workflow & activity) workers on Kubernetes. Today, I was wondering what the proper way is to upgrade these workers to a new version? Given Kubernetes wants to stop the old version, how can I align this with the worker process itself? My intent is this:

  1. the old worker can still finish the ongoing activity.
  2. the old worker may not pull for additional messages to process.
  3. the worker process ends gracefully and the pod is cleaned up

Aligned to it, if my workflow uses session support, can I keep my old worker alive until all the required activities for that session are executed?

Ringo

2 Likes

I believe both Java and Go SDK support a clean shutdown that waits for activities to complete without polling for new ones. The ability to wait for the whole session completion is not currently supported. I filed a feature request to get this added.

Go SDK worker.Run waits for TERM signal which pod sends to the process on clean shutdown and then executes the clean shutdown.

In Java the main method of the application has to deal with signal handling and then use worker.shutdown and worker.awaitTermination to wait for activity completions.

1 Like

I’ve put together a simple worker example that registers a couple of Activities and a simple 2 stage workflow that runs as a cron job. When I shutdown the worker I’ve noticed that it takes quite a long time for it to be removed from the list of “Pollers” in the webapp.

I’ve added a listener for when the java main shuts down and call the WorkerFactory.shutdown() method.

If I startup the worker again (to simulate what will happen in k8) I see multiple Pollers in the list, the old one and the new one.

  1. Does calling shutdown() on the WorkerFactory shutdown the worker correctly or do I need to call something else as well?

  2. Is there a reason why the old worker takes so long to get removed. It can take 10-15mins for it to be removed from the list?

  3. While the old worker remains in the list can it have any adverse affects on the running of the workflows?

Part of the reason I’m asking these questions is that the cron job I’m testing is set to run every minute but takes a long time before it actually triggers. I saw in another thread something about the WorkflowExecutionStarted.FirstWorkflowTaskBackoff? This explains why the cron job took so long to trigger but is this a setting that can be changed so the initial job triggers correctly i.e. if the cron is set to every minute it triggers after a minute?

  1. If you care about waiting for task completions you can call WorkerFactory.awaitTermination.

  2. The pollers list in the web app is for human troubleshooting. It keeps the information about the worker for 15 minutes, but it doesn’t mean that the system treats this worker as available. To receive tasks workers have to implicitly call Temporal service API, so remembering worker identity doesn’t affect that behavior at all.

  3. No, see 2.

The cron triggers when the cron expression tells it to trigger. So if the cron says that it should trigger at the beginning of the hour it is going to wait for the next beginning of an hour. To verify this behavior look at WorkflowExecutionStarted.FirstWorkflowTaskBackoff value in your workflow execution history.

Thanks Maxim for clarifying.