Workflow/Activity Getting Stuck on Worker Restart

Whenever I restart cadence workflow and activity workers - the workflows and activities which were running before worker restart get stuck.
For eg: Workflow gets stuck on “Activity Started” phase

I have to manually terminate the workflow and submit it again to resume working.

What should be done to let activities resume after workers are restarted?

1 Like

Both workflow and activity code is in python

Activities rely on timeouts for retries. Make sure that StartToClose timeout is set in the activity options to some reasonable value. If activity is long-running, set Heartbeat timeout, and make sure that the activity heartbeats more frequently than that timeout.

I have set the activity timeout to 5 hours – whereas the activity takes only 1 minute.
While the workflow is in “Activity Started” state and a I restart workers – the workflow gets stuck at “Activity Started” phase and doesn’t move forward.

It behaves exactly as designed. When worker goes down the service is going to wait 5 hours to restart the activity. Change the timeout to a couple of minutes, for example and the activity will be restarted in two minutes.

In the future we might implement worker heartbeat, then all activities that run on a worker are going to be restarted immediately on worker failure.

2 Likes

Thanks @maxim

@maxim what is the best way to periodically record hearbeat in activity code?
just putting
Activity.heartbeat(“heartbeat”)
in different parts of activity code feels very weird.

I tried to spawn a thread from main activity thread.

def heartbeat_fn():
    while True:
        Activity.heartbeat("heartbeat")
        time.sleep(2)


class ActivityImpl:
    def activity1(self, parameters: dict):
        x = threading.Thread(target=heartbeat_fn)
        x.daemon = True
        x.start()
       .... <main_activity_code>

But it gives below error:

Activity.heartbeat("heartbeat")
  File "/usr/local/lib/python3.7/site-packages/cadence/activity.py", line 109, in heartbeat
ActivityContext.get().heartbeat(details)
  File "/usr/local/lib/python3.7/site-packages/cadence/activity.py", line 73, in get
return current_activity_context.get()
LookupError: <ContextVar name='current_activity_context' at 0x7fdde93cdd70>

Having heartbeat in the actual activity implementation protects against code being stuck on something. I’ve seen production outages when an activity was completely deadlocked, but happily heartbeat from another thread.

If you still want to do it reverse your implementation. Heatbeat from the main activity thread as it relies on thread local and use different thread to implement the rest of the logic.