Restart workflow from a failed activity

Hello,

I am evaluating Temporal and had some questions on how i can resume a workflow from one of the failed activity.

in a give flow lets say there is a failure like risk check and needs to go into manual review, what is the recommended way to handle this.

Another question is an Entity on which a state change happens, which are steps of the workflow.
Example if there is an order entity, should we maintain the order in a different table and pass the reference to the workflow or even save the order on the temporal workflow.

Looks for recommendations on both.

in a give flow lets say there is a failure like risk check and needs to go into manual review, what is the recommended way to handle this.

Take a look at forum posts here and here that describe possible implementations for human tasks/approvals. Let me know if you have further questions on this.

Example if there is an order entity, should we maintain the order in a different table and pass the reference to the workflow or even save the order on the temporal workflow.

With Temporal workflow execution itself can be considered such an entity which can be updated (with workflow code / external signals) during execution. The workflow execution itself should be your “source of truth” and not an external persistence store in most cases.
Temporal gives you workflow id uniqueness guarantee (you cannot have more than one running workflow execution with a particular id at the same time, in the same namespace).

Thanks Tihomir. Signals seems to work for me to restart from a particular state.

Hello,

I’m new to Temporal (Java) and looking for help on retrying failed activities with a human intervention. I’m adding my question to this existing support thread, as the title describes precisely my case (which I think differs a from akrjohn’s case).

What I want to achieve:

  • my workflow has activities that call external services doing relatively long running and complex things
  • due to the complexity in the activities we need to anticipate failures, suppose there will be issues like network misconfigurations, coding bugs - these are not intermittent and require manual fix from engineers
  • we want the ability to resume the workflow from the failure point
  • but… we don’t want time based retries that will retry the failed activity according to a retry time schedule as it can worsen the effects of the failure (achieving true idempotency in complex tasks is hard etc)… only we - the engineers know when it makes sense to retry
  • so ideally we want human-controlled retry (fired by an UI button /not talking about Temporal UI here/, via Temporal client for the particular workflow)

I consider this a quite universal problem/ask but wasn’t able to find the exact way how to do it.

The related topics I read so far:

  • retries - RetryOptions seem to offer only time schedule based retry strategies, not suitable for the reasons explained
  • replays - these are intended to recreate workflow state, more in the realm of testing and development, rather than a regular way of dealing with failures
  • continue-as-new - continuation of workflows that addresses the history size limitations, again not something intended for failures handling
  • human interactions - there are a few threads on how to model human interactions with signals, however not in the failure/retry scenarios (the original post from akrjohn mentions activity failure, but what he describes later is actually an activity detecting application states that are seen as errors)

Any help greatly appreciated - this includes any workarounds or patterns achieving the desired functionality (e.q. some kind of internal exception catch in the activity impl, that works with signal in a loop etc.).

Thank you!

take a look at demo GitHub - temporalio/temporal-pause-resume-compensate

it uses SpringBoot but the pause/resume interceptor can be applied in any situation

this demo was tested on very large scale just fyi if it helps with confidence

Thank you Tihomir. The demo looks interesting, it actually covers other aspects we aim to cover soon too - the reverts. I adapted the approach to our code but I still have some blockers.

What we needed to do differently is to communicate the retry/fail signals to particular workflow executions, rather than applying it to all workflows, as your demo does. I have a similar listener registered on the interceptor:

public interface PauseResumeInterceptorListener {

    @SignalMethod
    void retry();

    @SignalMethod
    void fail();

    @QueryMethod(name = "awaitsManualRetry")
    Boolean awaitsManualRetry();

}

Communication via signal request is working fine - when I spin two workflow instances and use particular workflow ID, this code seems to be correlating correctly with the corresponding interceptor instance:

    private void sendSignal(String workflowId, String signalName) {
        WorkflowServiceGrpc.WorkflowServiceBlockingStub workflowServiceBlockingStub =
                workflowClient.getWorkflowServiceStubs().blockingStub();

        SignalWorkflowExecutionRequest req = SignalWorkflowExecutionRequest.newBuilder()
                .setNamespace(workflowClient.getOptions().getNamespace())
                .setWorkflowExecution(WorkflowExecution.newBuilder().setWorkflowId(workflowId))
                .setSignalName(signalName)
                .build();

        workflowServiceBlockingStub.signalWorkflowExecution(req);
    }

Now, the additional questions.

1) finding out that there’s the need for retry/fail human intervention
We need to find out that there’s the manual retry/fail decision required. Again, we need this per workflow execution. The workflow events history does not show any error, once the workflow ends up “hanging” in the PauseResumeWorkflowOutboundCallsInterceptor. I guess this is conceptually expected, as the activity didn’t technically fail (yet). FYI I’m posting a separate question on activity histories too.

So what I tried instead - I added the awaitsManualRetry query method (see the code snippet above). Documentation on messages is not describing our case - it demonstrates messages via stubs. We do not have workflow stub available - we need to interact with a running workflow over multiple UI sessions (so we persist workflow ID and the ID is always our starting point). I tried invoking the query like this:

        WorkflowStub ws = workflowClient.newUntypedWorkflowStub(workflowId);
        Boolean b = ws.query("awaitsManualRetry", Boolean.class);

OR

        WorkflowQuery query = WorkflowQuery.newBuilder().setQueryType("awaitsManualRetry").build();
        QueryWorkflowRequest queryWorkflowRequest = QueryWorkflowRequest.newBuilder()
                .setNamespace(workflowClient.getOptions().getNamespace())
                .setExecution(WorkflowExecution.newBuilder().setWorkflowId(workflowId))
                .setQuery(query)
                .build();

        QueryWorkflowResponse resp = workflowServiceBlockingStub.queryWorkflow(queryWorkflowRequest);

But I’m always getting

2024-11-14T15:06:48.926 [scheduling-1] DEBUG i.t.internal.retryer.GrpcRetryer - Final exception, throwing mdc=[traceId=ba34bd2c5b610a3632e4455890a049e8, spanId=23aff953699b78de]
io.grpc.StatusRuntimeException: NOT_FOUND: sql: no rows in result set
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165)
	at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.queryWorkflow(WorkflowServiceGrpc.java:4821)
	at io.temporal.internal.client.external.GenericWorkflowClientImpl.lambda$query$10(GenericWorkflowClientImpl.java:208)
	at io.temporal.internal.retryer.GrpcSyncRetryer.retry(GrpcSyncRetryer.java:69)
	at io.temporal.internal.retryer.GrpcRetryer.retryWithResult(GrpcRetryer.java:60)
	at io.temporal.internal.client.external.GenericWorkflowClientImpl.query(GenericWorkflowClientImpl.java:203)
	at io.temporal.internal.client.RootWorkflowClientInvoker.query(RootWorkflowClientInvoker.java:424)
	at io.temporal.common.interceptors.WorkflowClientCallsInterceptorBase.query(WorkflowClientCallsInterceptorBase.java:68)
	at io.temporal.opentracing.internal.OpenTracingWorkflowClientCallsInterceptor.query(OpenTracingWorkflowClientCallsInterceptor.java:116)
	at io.temporal.client.WorkflowStubImpl.query(WorkflowStubImpl.java:317)
	at io.temporal.client.WorkflowStubImpl.query(WorkflowStubImpl.java:307)

Any ideas what goes wrong? Remember, our listener is registered on the interceptor, rather than workflow (same as in your demo). Is this the factor?

2) resilience
The human intervention retry/fail logic is implemented in the interceptor. It waits for the signal via Promises. Is my understanding correct that this state won’t survive JVM restart? Same applies to the list of saga compensation activities. What will happen if workflow is paused in the Promise, because the retry/fail signal was not sent yet, and JVM is restarted? Will the workflow instance be reloaded - the activity that was waiting for the human interaction gets re-executed automatically?

Thank you!