Local workflow/activity fallback mode during backend failure scenario

Shanan · April 14, 2021, 12:56am

I’m curious if there’s ever been any discussion around supporting a fallback mode for temporal clients in the case of failure/outage with the temporal backend?

Say we have a scenario where a new workflow request is initiated in the client, but for whatever reason the gRPC connection is broken, or the backend is unresponsive.

The primary recover benefits for a temporal workflow can only be obtained if the workflow was ever registered before the outage. So new workflow requests are effectively hard down during this time.

This leads me to start wondering if it would be at all possible to allow the client to enter into a failure mode where the workflows and activities could be carried out, but contained entirely on the host receiving the request. In this mode the typical orchestration and communication to the backend service would be ignored.

This obviously comes with the tradeoffs that the single host is now at risk of its own failure with no recovery options, but at the same time it would allow workflows to possibly be processed and transition the outage from hard down to a functioning but degraded state.

maxim · April 14, 2021, 12:59am

Local fallback is not something we considered as it can lead to data loss in case of single node failures.

Temporal does support multi-cluster setups. So having a client that can seamlessly switch among clusters to avoid service disruption is something we eventually plan to support.

Shanan · April 14, 2021, 1:17am

Yeah the data loss is definitely a downside to it, but part of me wonders if that risk is worth the overall benefit of partial successes in the outage window.

Granted i’d imagine this would not be an encouraged configuration, but for non-mission critical workflows, I personally would love for this behavior in our systems

maxim · April 14, 2021, 1:22am

I think it is an interesting idea. I’ll keep it in mind when thinking about future work related to high availability.

Topic		Replies	Views
Business continuity in the case of a regional outage Community Support cassandra , multicluster	14	1624	September 1, 2022
Temporal integration for primary and DR environment Community Support general-impl	3	772	June 10, 2021
XDC Limitations and Tradeoffs Community Support java-sdk , xdc	0	563	January 12, 2022
Temporal DR Strategy Server Deployment general-impl	2	80	December 6, 2024
Disaster recovery Community Support	1	163	August 14, 2024

Local workflow/activity fallback mode during backend failure scenario

Related topics