Restarting Terminated, timed out, and canceled workflows representing cron jobs

Hi,

I’m trying to build a feature in my api to restart failed/ended workflows. I’ve noticed that doing so with cron jobs does not work, whereas other workflows can be restarted.

The following is my reset code:

public String resetWorkflowExecution(String workflowId, String terminationReason) {
        WorkflowStub workflowStub = client.newUntypedWorkflowStub(workflowId, Optional.empty(), Optional.empty());
        WorkflowExecution execution = workflowStub.getExecution();
        ResetWorkflowExecutionRequest request = ResetWorkflowExecutionRequest.newBuilder()
                .setNamespace("default")
                .setReason(terminationReason)
                .setRequestId(UUID.randomUUID().toString())
                .setWorkflowExecution(execution)
                .setWorkflowTaskFinishEventId(4)
                .build();


        ResetWorkflowExecutionResponse response = client.getWorkflowServiceStubs().blockingStub().resetWorkflowExecution(request);

        return "reset id: " + response.getRunId();
    }

and here is the error I get when restarting terminated and time out workflows:

"status": 500,
    "error": "Internal Server Error",
    "trace": "io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Workflow task finish ID must be > 1 && <= workflow last event ID.\r\n\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271)\r\n\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252)\r\n\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165)\r\n\tat io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.resetWorkflowExecution(WorkflowServiceGrpc.java:3015)

and this goes on and on. I see in the code where the error occurs, but I’m unsure how to fix this. I’m having trouble understanding what setWorkflowTaskFinishEventId does.

Another note: this error is not given when restarting a canceled workflow. My success message is sent, but it fails to restart. It just cancels/fails again.

Any help is appreciated. Thanks

"status": 500,
    "error": "Internal Server Error",
    "trace": "io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Workflow task finish ID must be > 1 && <= workflow last event ID.\r\n\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271)\r\n\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252)\r\n\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165)\r\n\tat io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.resetWorkflowExecution(WorkflowServiceGrpc.java:3015)

This error indicates that the reset id you specify in your ResetWorkflowExecutionRequest is either <= 1 or is > the latest event id in this workflows history.

Workflow executions are always created by the server. Typically when your client code requests a new execution server creates it and once created (and persisted) it is going to put a new workflow task on the task queue right away for your workers to pick it up and start execution of your workflow code.
At that point you have the following workflow history:

  1. WorkflowExecutionStarted (server created the workflow exec and started it)
  2. WorkflowTaskScheduled (server placed the wf task on the task queue)

when your worker picks up the task and completes the first workflow task (runs your workflow code up to a point where it accumulates commands that it needs to send back to server and blocks wf exec) you would also see:
3. WorkflowTaskStarted
4. WorkflowTaskCompleted
and so on…

I assume you are resetting to event id 4 with which you want to restart this execution right after the first workflow task was completed, which typically should be ok.

With cron workflows, given your defined cron schedule it’s a bit different. The server is still going to create and start the workflow execution but it is going to wait to place the first workflow task onto the task queue. The time its going to wait for this is noted in the firstWorkflowTaskBackoff property of the WorkflowExecutionStarted event.
What this means is that until the defined cron “fires” the events you will see in this history is just going to be:

  1. WorkflowExecutionStarted (server created the workflow exec and started it, with firstWorkflowTaskBackoff set).

Now cron workflows can time out on two things, WorkflowRunTimeout, which would time out the current cron run, and WorkflowExecutionTimeout which defines the global timeout for all cron runs for this cron execution.

If you have cron workflow where the cron never “fired” but it got timed out due to these timeouts you would see in history:

  1. WorkflowExecutionStarted (server created the workflow exec and started it)
  2. WorkflowExecutionTimedOut

similarly with terminated:

  1. WorkflowExecutionStarted
  2. WorkflowExecutionTerminated

In both cases the event id you are trying to reset to (4) is > than the latest event id in the history and would cause the error you have shown…

I think currently there is not a way to reset such cron workflows but you should be able to start a new cron execution and provide same input if possible.

One thing you could do in your reset code is to get the history via GetWorkflowExecutionHistory api and see if it has more events than your reset id.

Also to add, we have improved our cron feature a lot in 1.17 server release (docs here) , but note the apis for it will be added to SDKs soon.

Hope this helps.

Yes, this actually helps my understanding a lot. Would it work to get the length of the workflow execution history, and use this length as the event id to reset to? Or is that not possible with GetWorkflowExecutionHistory?

Would it work to get the length of the workflow execution history, and use this length as the event id to reset to?

Unfortunately no, you need to reset to either WorkflowTaskStarted or WorkflowTaskCompleted events in your history. Otherwise you can run into StatusRuntimeException with message “this event must be at the boundary” when trying to reset.

You could use GetWorkflowExecutionHistory to check if the event id you want to reset to would cause the previously mentioned error (in case workflow history has less events than the reset id), could also check to make sure that the history has at least one WorkflowTaskStarted/Completed events if you wanted.

Ok, thank you

Let’s say a cron job has had been fired, and its next iteration started, and we terminate at this point. Based on the available information, it seems we could restart this workflow by creating a new untyped workflow stub using workflow id & run id of the completed run (the first one) in which the fourth event will be “WorkflowTaskCompleted”. Is this true?

But then I also understand that this approach only works for when a cron job has been terminated/canceled/timed out after it has had a completed run.

You can use restart a previously completed cron execution if that is what you are asking.
If you have a cron exec running and you reset on a previously completed exec, the currently running one is going to be terminated so you might want to be careful if you do that.

I’m not sure about using UntypedWorkflowStub in this scenario, can you give more info?
Also could give more info on your use case please? Trying to see if there could be alternatives to having to purely rely on reset.

My use case is to be able to restart workflows that have failed, been terminated, been canceled, or time out. So I won’t be needing to restart currently running workflows. My program does not have a producer, but is rather just an API that gets information about workflows and has the ability to restart them. My use of untyped workflows can be seen in my original post. I used it to find the wf to restart.