Recover (Fallback) pattern for failing activity?

I’m processing a batch of actions and the result of all actions should be combined in a single final artifact. Individual actions are allowed to fail technically until max-attempts and be reported back separately while the rest of the batch continues.

How do I handle the failing activity?

At first I thought of a SAGA compensation, but that is to undo (recover) from successful previous actions. A workaround would be to define a compensation before attempting the activity, but that would be a workaround for something that warrants first class recovery (in generally accepted resilience theory). Alternatively, but even worse imo, is to make the activity aware of the retryOptions and explicitly recover with a different value on the maxth attempt. Yet another workaround would be to put a try…catch around the activity invocation in the workflow code and catch ActivityFailure and then decide what to do, but in my opinion this is the wrong abstraction level.

The recovery logic should be on the activity level, part of the retryOptions or marked with @RecoveryMethod annotation or an interface an activity can implement where you define the recovery.

So, is there a neat way to define recovery logic for when an activity fails definitively as per retryOptions? As I said I don’t need the whole flow to fail.

Something like this would be ideal:

@ActivityInterface
public interface MyActivity {
	
	@ActivityMethod(name = "My description")
	MyResult foo(Object input);
	
	@RecoveryMethod(name = "Do this instead", includeExceptions = { SomeException.class }) // or whatever would be reasonable
	MyResult bar(Object input, String type, String msg);
}

As a workaround, to achieve the same result transparently for all recoverable activities I might have, I’m currently employing the following setup. It requires more interface boilerplate and generics than I like, but the end usage is very clean and works across activities:

Generic workaround
interface InvokeActivity<I,O> {
	O invoke(I input);
}

interface RecoverableActivity<I,O> {
	O recover(I input, String type, String msg);
}

static <I,O> O tryOrRecover(RecoverableActivity<I, O> r, InvokeActivity<I, O> ia, I i) {
	try {
		return ia.invoke(i);
	} catch (ActivityFailure e) {
        ApplicationFailure cause = (ApplicationFailure) e.getCause();
        return r.recover(i, cause.getType(), cause.getOriginalMessage());
	}
}

Then my activity is as follows:

@ActivityInterface
public interface MyActivity extends RecoverableActivity<MyInput, MyOutput> {
	
	@ActivityMethod(name = "Do foo")
	MyOutput foo(MyInput input);
}

public class MyActivityImpl implements MyActivity {
	
	@Override
	public MyOutput foo(MyInput input) {
		return ...result;
	}
	
	@Override
	public MyOutput recover(MyInput input, String type, String msg) {
		return ...fallback;
	}
}

This makes invocation from the workflow very clean:

MyOutput result = tryOrRecover(myActivity, myActivity::foo, inputValue);

Unfortunately, because RecoverableActivity uses generics, Temporal incorrectly concludes the first parameter of the method is Object, rather than the passed in generic type. So Temporal is attempting to pass the original input object as a LinkedHashMap resulting in a ClassCastException.

To fix this, the activity should not directly implement this interface, but just the method that complies with the method signature so it can be coalesced as a functional interface:

MyOutput result = tryOrRecover(myActivity:recover, myActivity::foo, inputValue);

This actually works and satisfies my need. However, I wish this was built in.

Indeed, generics are not RPC friendly.

It is really cool that you were able to satisfy your requirements through a simple utility class. I’m not yet convinced that it should be added to the SDK itself, but anyone can can create their own based on your post.

No thoughts on the general concept of recoverable activities? Or on which level this should be defined? What about the suggested annotation of @RecoveryMethod as an optional companion method to @ActivityMethod?

In the use cases I was involved in the arguments of the recovery method were frequently different from the original method. The recovery logic also had to update the workflow state. So the approach of registering a callback instead of method annotations was used by the Saga implementation.

My view is that Saga approach makes more sense as it relies on activity retries to complete the operation for intermittent problems and registers compensation in case of the activity’s success.

I respectfully disagree. The SAGA pattern is for dealing with partial rollbacks to fail gracefully, but finally resulting in a failed workflow. Recovery/Fallback on the other hand is about moving forward with an alternative/default value and/or workflow branch, finally resulting in a successful workflow. The term recovery is a little loaded though, since some libraries use it to mean fallback/default while others mean retry/rollback.

Regarding the use cases you mentioned, I agree this was my experience as well… until now. Regardless of our personal experiences though, Recovery/Fallback is a general principle in resilience theory. I am by no means an expert on the matter, far from it, but for example Uwe Friedrichsen goes into this at length in Patterns of Resilience and Functional Service Design and Observability and gave us this interesting chart where I highlighted some supported basics by Temporal and the missing recovery (actually mitigation) aspect:

This chart makes it very clear that SAGA is about recovery while Fallback (local recovery or forward recovery, yes I just made that up) is about mitigation. As far as I know, Temporal offers nothing to mitigate failing activities. My utility class solves this for me but I think it would benefit Temporal at large to offer mitigation facilities as well.

It’s not just Uwe though, if you look at general purpose libraries for resilience, like resilience4j, they generally support a fallback/recovery mitigation facility:

When it comes to making processes resilient, I think mitigation by fallback is an essential technique and Temporal should have first class support for it. Take the following trivial generalized workflow for example, which currently is impossible with current SDK:

I would request you reconsider your position on this topic, @maxim.

Thanks for such a thorough reply. Let me go over the links and think about this to give you an educated opinion.

As far as I know, Temporal offers nothing to mitigate failing activities. My utility class solves this for me but I think it would benefit Temporal at large to offer mitigation facilities as well.

The good news is that Temporal offers the full power of Java (and other languages) to implement whatever requirements you have. The fact that you were able to write your utility class without any changes to the SDK or the service contradicts your statement of “Temporal offers nothing to mitigate failing activities” :).

Hi @maxim, have you been able to form more of an opinion on this?

I didn’t get much time. But currently I think that try-catch with the compensation or a default value is the way to go for the fallback scenarios.