In my project, we are generating millions of long-running workflows. For each activity, there’s always a chance of pro-long failure and we don’t want to retry indefinitely. Instead, we want to let the activity fails and then manually trigger a reset when the problem is resolved. The thing is we cannot reset workflow 1-by-1 manually. We need a way to trigger mass reset with 1 click.
One thing to note is that we also don’t want to reset ALL failed workflows at once and produce a thundering herd. Hence, my end goal is to allow resetting X% of failed workflow at a time. X will be given by user.
At the moment, on workflow code, I can upsert a search attribute to allow me to search specifically for those workflows that failed because of activity failure. Subsequently, using ListWorkflowExecutionsRequest, I can count + search for all failed workflows matching my filter. With this count, I can write logic to handle only X% of failed workflows.
At this point, I’m wondering if there’s a way I can find the last point of failure without reading history for each and every workflow execution from the search result (which can be in the millions).
I thought WorkflowExecutionInfo.getAutoResetPoints() would return the entire list of potential reset points that show up when we click the Reset button on Temporal Web UI. However, when I inspected the network response, this field doesn’t contain what I’m looking for.
Another question I have is what field in WorkflowExecutionInfo or WorkflowExecution can we use to detect that a workflow has been reset? My plan is to execute mass reset in an activity by searching page by page and use heartbeat to store the last processed page. When I need to retry this activity, I may re-process a page and I need a mechanism to check if I’ve already reset this workflow execution before and skip it.
We finally started work on improving reset. We are going to take your requirements into account.
But until it is directly supported I recommend an alternative approach that doesn’t rely on reset. The idea is that you can write a workflow interceptor that after receiving an activity failure would pause until a signal received. This signal would tell either retry again or fail the activity.
Regarding making improvement, I have another suggestion. At the moment, because of the way ListWorkflowExecutionsRequest is handled using nextPageToken, I can only query page by page sequentially on a single machine. Hence, my implementation for sending signals / cancelling / resetting all workflows matching a filter may take some time when the number of resulting workflows is big.
If possible, please allow for concurrent search by multiple machines using offset and limit. For example, assuming I have 3 machine and limit is 5,
Machine 1 can search for offset 0, 15, 30, … sequentially
Machine 2 can search for offset 5, 20, 35, … sequentially
Machine 3 can search for offset 10, 25, 40, … sequentially
Using Async.procedure, I can easily spawn X activity tasks that would be picked up by different machines and they can independently work on the initial offset assigned to them and calculate subsequent offsets based on X and limit.
Regarding the sample code for retry on signal, I was considering building a similar mechanism directly inside workflow code with some generic try catch logic around every activity. Thanks for pointing out WorkerInterceptor, it’s very helpful!
I have a few questions:
1… In RetryOnSignalWorkflowOutboundCallsInterceptor, we have this List<ActivityRetryState<?>> pendingActivities which is a simple ArrayList. Are we going to have a race condition when a workflow instance asynchronously execute multiple activities at the same time?
In addition, pendingActivities.remove() is only called on successful activity invocation, meaning the list will keep growing if some activities fails but the workflow ignores the errors and keeps running?
2… Regarding the comment below, let’s say I want to retry only activity type X. Where can I get X? ActivityInput doesn’t contain much information other than ActivityOptions?
/**
* For the example brevity the interceptor fails or retries all activities that are waiting for an
* action. The production version might implement retry and failure of specific activities by
* their type.
*/