I’m trying out the failure recovery scenario of temporal. I execute a workflow with a long-lived activity that sent a heartbeat every second and heartbeat-timeout is two seconds, and there is only one worker to execute the workflow. When I killed the worker and immediately restart it, the active activity is not resumed and returned activity task timeout, the errType is heartbeat timeout.
Is there any way to resume the activity from where it was interrupted when worker is recovered? If yes, are there some examples for reference?
Hello @haojie
The activity should be retried after starting the worker.
Can you show the ActivityOptions?
Is there any way to resume the activity from where it was interrupted when worker is recovered?
Heartbeat allows you to set details that you can use to track activity progress. One thing you can do is to check if the activity has previous heartbeat and resume the activity execution from the last heartbeat.
startIndex := 0
if activity.HasHeartbeatDetails(ctx) {
var finishedIndex int
if err := activity.GetHeartbeatDetails(ctx, &finishedIndex); err == nil {
startIndex = finishedIndex + 1
}
}
for i := startIndex; i < 10; i++ {
activity.RecordHeartbeat(ctx, i)
time.Sleep(1 * time.Second)
}
Hi, @antonio.perez , thanks a lot for your reply.
I have roughly understood the behavior of temporal, temporal monitors the process of an active activity by heatbeat mechanism, and once the heartbeat times out, the activity will be triggered to retry. In addition, heartbeat details can save the state of activity with last heartbeat.
As a result, if we want to resume the execution of activity, we must config ‘heartbeat timeout’ and ‘retry policy’. Is my understanding correct?
Hello @haojie
For long-running activities, we always recommend heartbeating to notify the server the activity is still alive. If the activity does not heartbeat within the timeout period, the server will fail the activity and the activity will be (or not, depending on your activity options) retried.
For short-running activities, you probably don’t need to heartbeat, but it depends on you.
In addition, heartbeat details can save the state of activity with last heartbeat.
Yes, but it could happen that your worker crashed before sending the heartbeat. It is always recommended to have an idempotent activity
If you don’t set heartbeattimeout in your activity options, heartbeating from the activity won’t take any effect. If your server crashes and you have not set heartbeattimeout, the activity won’t be retried until the startToCloseTimeout is reached.
Retry Options allow you to specify how your activity should behave when it fails for any reason. You can specify, for example, how many retries (MaximumAttempts), setting this to 1 means no retries. Also, keep in mind that setting ScheduleToCloseTimeout
can affect the number of retries if this timeout is reached before the MaximumAttempts.
I recommend you to watch this video from Maxim explaining these concepts: https://www.youtube.com/watch?v=JK7WLK3ZSu8