Workflow not recovered after crash

Hello,
I’ve have a simple workflow to test failover scenarious. I see undesired behavious in following simple case (Activity method configured with heartbeat timeout 3 seconds and send heartbeat every second):

  • workflow initiator process async start workflow
  • first worker start workflow execution, I see workflow task executed in thread “workflow-method” and activity method started in thread like “Activity Executor taskQueue=“MAIN_TASK_QUEUE”, namespace=“default”: 1”
  • I stop first worker to simulate crash and see that second worker try to execute activity task but fails with error " NOT_FOUND: invalid activityID or activity already timed out or invoking workflow is completed". Retry and fail with same reason, so execution hangs in such retries.

Please explain me what I’m doing wrong.

Maybe code example will be helpful:

// Create and start workflow
        WorkflowOptions workflowOptions = WorkflowOptions.newBuilder()
                .setTaskQueue(IMainTaskQueue.MAIN_TASK_QUEUE)
                .setWorkflowId(UUID.randomUUID().toString())
                .build();
        ISimpleWorkflow simpleWorkflow = workflowClient.newWorkflowStub(ISimpleWorkflow.class, workflowOptions);
        WorkflowClient.start(simpleWorkflow::doWork, payload);
...
    // Activity options
    private ActivityOptions ACTIVITY_OPTIONS = ActivityOptions.newBuilder()
            .setScheduleToCloseTimeout(Duration.ofDays(1))
            .setHeartbeatTimeout(Duration.ofSeconds(3))
            .build();

I would start troubleshooting from looking at the workflow execution history. Could you post it here?

@maxim thank you for reply!
I export execution history from Temporal WebUI, but can’t attach it here,
placed it to google docs https://docs.google.com/document/d/19412wqlrcU159ZtSHZPSLyE2DE89H4RFeKP5rj7XrOQ
is this what you want?

It looks like the activity is failing or timing out. Do you see the activity information in the summary view of the workflow? It should show how many times the activity was retried and the last error information.

Yes, I see several attemts which are constantly increasing and lastFailure is “activity timeout”.
The activity configured with ScheduleToCloseTimeout == 1 day and HeartbeatTimeout == 3 seconds, I can assume that activity is timed out, because HeartbeatTimeout was failed.
But I can’t find a right way to recover workflow after worker crashed, what is wrong in my configuration?

I share my sandbox project GitHub - RoadRoller/temporal, could you take a look and suggest what I’m doing wrong?

So workflow is fine and doesn’t need recovering. Your activity is constantly timing out and retried according the default retry policy. Is your activity heartbeating? Heartbeat timeout requires activity call calling heartbeat method at least once every 3 seconds in your case.

My activity send hearbeats every second and workflow successfully executed without simulate worker crash. As I understand hearbeating is mechanism for determine temporal service that activity is failed on some worker and this part works fine, server determine that activity failed when worker stopped. But next what I expect that workflow continue execution on another worker, but instead of it activity failed on second worker with exeption “NOT_FOUND: invalid activityID or activity already timed out or invoking workflow is completed”.
So I can’t simulate simplest failover scenario:

  1. workflow execution started
  2. worker crashed
  3. workflow continue execution on another worker

IMHO it should be key functionality of the temporal framework and suppose I don’t understand something about the basic principles of how it should work.

For workflow to continue executing on another worker the activity has to complete first as the workflow is blocked on the activity completion. As you pointed out earlier the activity is constantly retried by the framework which is indicated by the ever-growing attempts number.

Try increasing heartbeat timeout (let’s say to 20 seconds) to see if the issue is due to the heartbeat not reaching service in time.

I try to play with heartbeat timeout, but no luck.
As I understand after first worker crashed heartbeat timeout detected and persisted by the server. This firstly detected hearbeat timeout not resetted for next execution attempts. Which leads to pointless infinite reattempts.
Suppose that expected behaviour: for next activity execution attempt previous heartbeat timeout should be cleared otherwise long running activity never have a chance of successfully completing after worker crash.

Does your activity heartbeat? If it is not then changing heartbeat timeout is not going to help as it is going to timeout always.

Yes my activity send heartbets every second with

Activity.getExecutionContext().heartbeat(i);

and it executed successfully without simulation of worker crash.

I’m not able to reproduce the problem. Here is my code.. The activity is retried on a different worker after the heartbeat timeout after the worker that executes it is killed.

Thank you for the code example! Finally, I’ve found the reason, it was a stupid mistake in my code. Sorry for disturbing, all works as expected in this case.

1 Like