We have a scenario where in 2 of the activities are executed in parallel, in which one of them is failed. On a reset, how can we make sure that, the successfully completed activity is not re-triggered?
For example, my event history looks like below where 2 activities are scheduled at the event 5 (failureOnFirstTry), 6 (SayHello) in parallel and one of them is failed (event 13). Now the reset option gives me option to reset the workflow to either 4 or 13 (WorkflowTaskCompleted).
When I reset the workflow to 4, the both activities are re-scheduled and executed which we want to avoid and execute only the failed one.
We would like to know how this can be implemented at the worker process to identify the failed activities and execute them only.
Event History Date & Time Workflow Events
14 2024-05-06 UTC 16:48:57.77 WorkflowExecutionFailed Failure Message Activity task failed
13 2024-05-06 UTC 16:48:57.77 WorkflowTaskCompleted Scheduled Event ID 9
12 2024-05-06 UTC 16:48:57.64 WorkflowTaskStarted Scheduled Event ID 9
11 2024-05-06 UTC 16:48:57.59 ActivityTaskFailed Failure Message Failed
10 2024-05-06 UTC 16:48:57.46 ActivityTaskStarted Scheduled Event ID 5
9 2024-05-06 UTC 16:48:57.53 WorkflowTaskScheduled Task Queue Name gvp-worker-1:a63764ca-ca3e-4b91-87dc-9351426a2a32
8 2024-05-06 UTC 16:48:57.53 ActivityTaskCompleted Result [“Hello Temporal at Mon May 06 11:48:57 CDT 2024”]
7 2024-05-06 UTC 16:48:57.42 ActivityTaskStarted Scheduled Event ID 6
6 2024-05-06 UTC 16:48:57.34 ActivityTaskScheduled Activity Type SayHello
5 2024-05-06 UTC 16:48:57.34 ActivityTaskScheduled Activity Type failureOnFirstTry
4 2024-05-06 UTC 16:48:57.34 WorkflowTaskCompleted Scheduled Event ID 2
3 2024-05-06 UTC 16:48:56.99 WorkflowTaskStarted Scheduled Event ID 2
2 2024-05-06 UTC 16:48:56.89 WorkflowTaskScheduled Task Queue Name demo-queue
1 2024-05-06 UTC 16:48:56.88 WorkflowExecutionStarted
What is the business problem you are trying to solve? Reset is expected to be used for exceptional situations like bugs. Have you considered implementing your workflow to support your use case without failing?
We’re currently developing an orchestration platform to automate our infrastructure management, overseeing the coordination of more than 50 downstream services crucial to our processes.
Occasionally, some of these services may become temporarily unavailable due to maintenance, resulting in failures in downstream API calls. It’s imperative that our platform can efficiently recover from these failures once the services are operational again. However, we must also ensure that our workflows do not run indefinitely in the event of failures from downstream systems.
If resets are reserved for exceptional cases only, we need a robust strategy to handle failures when a workflow fails for any reason.
@Rajendra_Prasad_G I am the Solutions Architect assigned to your company account. In general, it is not a good idea to expect your Workflow to fail. Error handling and retries should be usually done at the Activity level. If you would like to email me at cici.thomson@temporal.io, I’m happy to schedule a call and go over your use case in more detail so I can give more specific recommendations.
We have a demo that uses interceptors that can “pause” a workflow execution and wait for an event (signal) to retry to last failed activity and then continue if that helps: GitHub - tsurdilo/temporal-pause-resume-compensate
(note this works for both sync and async activities)
In general, yes reset can be useful but being able to detect and act upon downstream failure and be able to incorporate it as part of your business logic impl of the use case is imo equally important. Temporal imo allows you to do this exceptionally well and be able to monitor and alert on these situations. If you give a bit more info on specific scenario then would be happy to provide details.
@tihomir - my scenario is listed in the question itself, the requirement I have here is as follows:
Couple of activities are executed in parallel, and when any of them are failed due to any reasons, we want the workflow to be recovered and continued by retrying the failure activities only.
Maxim responded to this question as, this scenario is not supported at the moment, hence looking for alternate solutions on handling this scenario.
@Rajendra_Prasad_G My response was related to the “reset” functionality not supporting preserving the state of other activity.
Your scenario is trivially supported by retrying the activity as long as needed to recover. Keep retrying them until they are complete, and the workflow will continue executing without any manual intervention.