Local workflow/activity fallback mode during backend failure scenario

I’m curious if there’s ever been any discussion around supporting a fallback mode for temporal clients in the case of failure/outage with the temporal backend?

Say we have a scenario where a new workflow request is initiated in the client, but for whatever reason the gRPC connection is broken, or the backend is unresponsive.

The primary recover benefits for a temporal workflow can only be obtained if the workflow was ever registered before the outage. So new workflow requests are effectively hard down during this time.

This leads me to start wondering if it would be at all possible to allow the client to enter into a failure mode where the workflows and activities could be carried out, but contained entirely on the host receiving the request. In this mode the typical orchestration and communication to the backend service would be ignored.

This obviously comes with the tradeoffs that the single host is now at risk of its own failure with no recovery options, but at the same time it would allow workflows to possibly be processed and transition the outage from hard down to a functioning but degraded state.

Local fallback is not something we considered as it can lead to data loss in case of single node failures.

Temporal does support multi-cluster setups. So having a client that can seamlessly switch among clusters to avoid service disruption is something we eventually plan to support.

Yeah the data loss is definitely a downside to it, but part of me wonders if that risk is worth the overall benefit of partial successes in the outage window.

Granted i’d imagine this would not be an encouraged configuration, but for non-mission critical workflows, I personally would love for this behavior in our systems

I think it is an interesting idea. I’ll keep it in mind when thinking about future work related to high availability.