Workflow not recovered after crash

roadroller · March 19, 2021, 10:40am

Hello,
I’ve have a simple workflow to test failover scenarious. I see undesired behavious in following simple case (Activity method configured with heartbeat timeout 3 seconds and send heartbeat every second):

workflow initiator process async start workflow
first worker start workflow execution, I see workflow task executed in thread “workflow-method” and activity method started in thread like “Activity Executor taskQueue=“MAIN_TASK_QUEUE”, namespace=“default”: 1”
I stop first worker to simulate crash and see that second worker try to execute activity task but fails with error " NOT_FOUND: invalid activityID or activity already timed out or invoking workflow is completed". Retry and fail with same reason, so execution hangs in such retries.

Please explain me what I’m doing wrong.

Maybe code example will be helpful:

// Create and start workflow
        WorkflowOptions workflowOptions = WorkflowOptions.newBuilder()
                .setTaskQueue(IMainTaskQueue.MAIN_TASK_QUEUE)
                .setWorkflowId(UUID.randomUUID().toString())
                .build();
        ISimpleWorkflow simpleWorkflow = workflowClient.newWorkflowStub(ISimpleWorkflow.class, workflowOptions);
        WorkflowClient.start(simpleWorkflow::doWork, payload);
...
    // Activity options
    private ActivityOptions ACTIVITY_OPTIONS = ActivityOptions.newBuilder()
            .setScheduleToCloseTimeout(Duration.ofDays(1))
            .setHeartbeatTimeout(Duration.ofSeconds(3))
            .build();

maxim · March 19, 2021, 4:01pm

I would start troubleshooting from looking at the workflow execution history. Could you post it here?

roadroller · March 19, 2021, 4:29pm

@maxim thank you for reply!
I export execution history from Temporal WebUI, but can’t attach it here,
placed it to google docs https://docs.google.com/document/d/19412wqlrcU159ZtSHZPSLyE2DE89H4RFeKP5rj7XrOQ
is this what you want?

maxim · March 19, 2021, 4:49pm

It looks like the activity is failing or timing out. Do you see the activity information in the summary view of the workflow? It should show how many times the activity was retried and the last error information.

roadroller · March 22, 2021, 2:59am

Yes, I see several attemts which are constantly increasing and lastFailure is “activity timeout”.
The activity configured with ScheduleToCloseTimeout == 1 day and HeartbeatTimeout == 3 seconds, I can assume that activity is timed out, because HeartbeatTimeout was failed.
But I can’t find a right way to recover workflow after worker crashed, what is wrong in my configuration?

I share my sandbox project GitHub - RoadRoller/temporal, could you take a look and suggest what I’m doing wrong?

maxim · March 22, 2021, 4:43am

So workflow is fine and doesn’t need recovering. Your activity is constantly timing out and retried according the default retry policy. Is your activity heartbeating? Heartbeat timeout requires activity call calling heartbeat method at least once every 3 seconds in your case.

roadroller · March 22, 2021, 5:20am

My activity send hearbeats every second and workflow successfully executed without simulate worker crash. As I understand hearbeating is mechanism for determine temporal service that activity is failed on some worker and this part works fine, server determine that activity failed when worker stopped. But next what I expect that workflow continue execution on another worker, but instead of it activity failed on second worker with exeption “NOT_FOUND: invalid activityID or activity already timed out or invoking workflow is completed”.
So I can’t simulate simplest failover scenario:

workflow execution started
worker crashed
workflow continue execution on another worker

IMHO it should be key functionality of the temporal framework and suppose I don’t understand something about the basic principles of how it should work.

maxim · March 22, 2021, 3:16pm

For workflow to continue executing on another worker the activity has to complete first as the workflow is blocked on the activity completion. As you pointed out earlier the activity is constantly retried by the framework which is indicated by the ever-growing attempts number.

Try increasing heartbeat timeout (let’s say to 20 seconds) to see if the issue is due to the heartbeat not reaching service in time.

roadroller · March 24, 2021, 3:26am

I try to play with heartbeat timeout, but no luck.
As I understand after first worker crashed heartbeat timeout detected and persisted by the server. This firstly detected hearbeat timeout not resetted for next execution attempts. Which leads to pointless infinite reattempts.
Suppose that expected behaviour: for next activity execution attempt previous heartbeat timeout should be cleared otherwise long running activity never have a chance of successfully completing after worker crash.

maxim · March 24, 2021, 3:46am

Does your activity heartbeat? If it is not then changing heartbeat timeout is not going to help as it is going to timeout always.

roadroller · March 24, 2021, 3:50am

Yes my activity send heartbets every second with

Activity.getExecutionContext().heartbeat(i);

and it executed successfully without simulation of worker crash.

maxim · March 24, 2021, 4:13am

I’m not able to reproduce the problem. Here is my code.. The activity is retried on a different worker after the heartbeat timeout after the worker that executes it is killed.

roadroller · March 24, 2021, 9:41am

Thank you for the code example! Finally, I’ve found the reason, it was a stupid mistake in my code. Sorry for disturbing, all works as expected in this case.

Topic		Replies	Views
Activity not recovered after worker restarted Community Support go-sdk , general-impl	3	895	February 9, 2023
Worker does not start activity after restart Community Support go-sdk , retries , worker	17	3393	May 24, 2021
Workflow and Activity timeout Community Support general-impl	9	5067	July 4, 2023
ActivityWorker Error invalid activityID or activity already timed out or invoking workflow is completed Community Support go-sdk	2	1704	September 23, 2021
Workflow retries logic Community Support go-sdk	5	624	April 10, 2023

Workflow not recovered after crash

Related topics