Handling failure/success activities scheduled in parallel on reset/new run

Rajendra_Prasad_G · May 6, 2024, 6:26pm

We have a scenario where in 2 of the activities are executed in parallel, in which one of them is failed. On a reset, how can we make sure that, the successfully completed activity is not re-triggered?

For example, my event history looks like below where 2 activities are scheduled at the event 5 (failureOnFirstTry), 6 (SayHello) in parallel and one of them is failed (event 13). Now the reset option gives me option to reset the workflow to either 4 or 13 (WorkflowTaskCompleted).

When I reset the workflow to 4, the both activities are re-scheduled and executed which we want to avoid and execute only the failed one.

We would like to know how this can be implemented at the worker process to identify the failed activities and execute them only.

Event History Date & Time Workflow Events
14 2024-05-06 UTC 16:48:57.77 WorkflowExecutionFailed Failure Message Activity task failed
13 2024-05-06 UTC 16:48:57.77 WorkflowTaskCompleted Scheduled Event ID 9
12 2024-05-06 UTC 16:48:57.64 WorkflowTaskStarted Scheduled Event ID 9
11 2024-05-06 UTC 16:48:57.59 ActivityTaskFailed Failure Message Failed
10 2024-05-06 UTC 16:48:57.46 ActivityTaskStarted Scheduled Event ID 5
9 2024-05-06 UTC 16:48:57.53 WorkflowTaskScheduled Task Queue Name gvp-worker-1:a63764ca-ca3e-4b91-87dc-9351426a2a32
8 2024-05-06 UTC 16:48:57.53 ActivityTaskCompleted Result [“Hello Temporal at Mon May 06 11:48:57 CDT 2024”]
7 2024-05-06 UTC 16:48:57.42 ActivityTaskStarted Scheduled Event ID 6
6 2024-05-06 UTC 16:48:57.34 ActivityTaskScheduled Activity Type SayHello
5 2024-05-06 UTC 16:48:57.34 ActivityTaskScheduled Activity Type failureOnFirstTry
4 2024-05-06 UTC 16:48:57.34 WorkflowTaskCompleted Scheduled Event ID 2
3 2024-05-06 UTC 16:48:56.99 WorkflowTaskStarted Scheduled Event ID 2
2 2024-05-06 UTC 16:48:56.89 WorkflowTaskScheduled Task Queue Name demo-queue
1 2024-05-06 UTC 16:48:56.88 WorkflowExecutionStarted

Any help on this is much appreciated.

maxim · May 6, 2024, 6:28pm

Currently this is not supported. We are planning multiple improvements to the reset and your use case will be taken into consideration.

Rajendra_Prasad_G · May 6, 2024, 6:52pm

Thanks @maxim for quick response.

While we are waiting for a strategic solution from Temporal, is there any workaround to handle this situation.

One idea I have is below, let me know if that works.

Identify whether the current activity execution is as part of a new run Id or reset run Id.
If it is part of new run, execute it.
If it is part of reset run, check the status of it in previous run and decide on execution of the activity.

Also to get this work, we need to find out below in the worker :

How to identify the current run type (whether it is new or part of reset)
If it is reset, then how can we get the previous run details (execution status).
How can we query/identify the execution status of the activity by run Id

maxim · May 6, 2024, 7:47pm

What is the business problem you are trying to solve? Reset is expected to be used for exceptional situations like bugs. Have you considered implementing your workflow to support your use case without failing?

Rajendra_Prasad_G · May 22, 2024, 10:27am

@maxim Sorry for late reply.

We’re currently developing an orchestration platform to automate our infrastructure management, overseeing the coordination of more than 50 downstream services crucial to our processes.

Occasionally, some of these services may become temporarily unavailable due to maintenance, resulting in failures in downstream API calls. It’s imperative that our platform can efficiently recover from these failures once the services are operational again. However, we must also ensure that our workflows do not run indefinitely in the event of failures from downstream systems.

If resets are reserved for exceptional cases only, we need a robust strategy to handle failures when a workflow fails for any reason.

Ci-Ci_Thomson · May 22, 2024, 2:12pm

@Rajendra_Prasad_G I am the Solutions Architect assigned to your company account. In general, it is not a good idea to expect your Workflow to fail. Error handling and retries should be usually done at the Activity level. If you would like to email me at cici.thomson@temporal.io, I’m happy to schedule a call and go over your use case in more detail so I can give more specific recommendations.

tihomir · May 22, 2024, 3:31pm

We have a demo that uses interceptors that can “pause” a workflow execution and wait for an event (signal) to retry to last failed activity and then continue if that helps: GitHub - tsurdilo/temporal-pause-resume-compensate
(note this works for both sync and async activities)

In general, yes reset can be useful but being able to detect and act upon downstream failure and be able to incorporate it as part of your business logic impl of the use case is imo equally important. Temporal imo allows you to do this exceptionally well and be able to monitor and alert on these situations. If you give a bit more info on specific scenario then would be happy to provide details.

Rajendra_Prasad_G · May 22, 2024, 4:57pm

@tihomir - my scenario is listed in the question itself, the requirement I have here is as follows:

Couple of activities are executed in parallel, and when any of them are failed due to any reasons, we want the workflow to be recovered and continued by retrying the failure activities only.

Maxim responded to this question as, this scenario is not supported at the moment, hence looking for alternate solutions on handling this scenario.

maxim · May 22, 2024, 6:03pm

@Rajendra_Prasad_G My response was related to the “reset” functionality not supporting preserving the state of other activity.

Your scenario is trivially supported by retrying the activity as long as needed to recover. Keep retrying them until they are complete, and the workflow will continue executing without any manual intervention.

Topic		Replies	Views
How to reset a failed "activity" gracefully Community Support go-sdk	3	883	September 17, 2022
Workflow Retry - Workflow should skip Activities which are successful in previous run Developer Corner java-sdk	8	1024	October 18, 2024
Reset workflow with long running activities broken into launch and watch activities Community Support go-sdk , retries	3	144	February 29, 2024
Resuming a failed workflow from the point of failure Community Support java-sdk , mysql , retries	6	3819	February 7, 2023
How to Replay a Activity after failure from externally Developer Corner java-sdk	5	1234	March 15, 2024

Handling failure/success activities scheduled in parallel on reset/new run

Related topics