Catch exception for dotnet temporal hosting

I used dotnet nuget package Temporalio.Extensions.Hosting to create hosted Temporal worker by:
services.AddHostedTemporalWorker()

Using this will cause the app crash when the Temporal service is stopped. Is it possible to make the app continue to run normally, just without using Temporal?

Event: BackgroundServiceFaulted
Exception: System.InvalidOperationException
Stack trace:

   at Temporalio.Bridge.Client.ConnectAsync(Runtime runtime, TemporalConnectionOptions options)
   at Temporalio.Client.TemporalConnection.GetBridgeClientAsync()
   at Temporalio.Client.TemporalConnection.ConnectAsync(TemporalConnectionOptions options)
   at Temporalio.Client.TemporalClient.ConnectAsync(TemporalClientConnectOptions options)
   at Temporalio.Extensions.Hosting.TemporalWorkerService.ExecuteAsync(CancellationToken stoppingToken)
   at Microsoft.Extensions.Hosting.Internal.Host.TryExecuteBackgroundServiceAsync(BackgroundService backgroundService)

In this case, it seems like connection is failing. This is a common hosed/background service like any other and so can fail like any other. If you don’t want a single background service to cause the rest of the host to fail, you can use the same approach you might with any other .NET service, it is not Temporal specific. This would probably mean providing your own implementation of BackgroundService/IHostedService that wraps/extends TemporalWorkerService and captures the exception, and then register that to the service collection instead of using AddHostedTemporalWorker.

Same problem here. I think, the question is: Is there an easy way or any guidance how to implement a TemporalWorkerService as a resilient, fault-tolerant (hosted) BackgroundService?

It should handle the following cases:

  1. The Temporal server is not (yet) available when the worker starts up. In this case, some retry behavior for reconnecting and registering the workflows/activities is needed.
  2. The Temporal server goes down while the worker is running. Not sure what needs to be done in this case. Do we need to re-connect and re-register the workflows/activities? If yes, how can this situation be detected and handled by the worker?

It already is reasonably resilient. We make sure to retry, but of course we fail fast on start because you should not start something that may never work.

You can connect a separate client, but you will want to make this separate client connection yourself and wait yourself before continuing the initialization of your application. The default mode of the workers, by intention, is to fail to start if it fails to start. For many this may mean a mistyped server, not just one that is temporarily unavailable, and of course the code can’t differentiate.

This situation is already handled by the worker. There are cases where if it is a known non-retryable/non-transient service error, the worker can fail after a minute of retrying. But in most cases, it will retry forever. You should test your server-down situation to confirm behavior as there are different forms of a server being “down”.

Thanks Chad for clarifying things.

Here I respectfully disagree. IMHO it can make sense to start something that is expected to work in the future. Would be great to have a simple option in AddHostedTemporalWorker to control the initial connect behavior.

Note this is about client connectivity and is unrelated to worker. I am unsure it would be a simple option, it’d probably be a full retry policy. Sometimes you want to fail (e.g. namespace doesn’t exist) vs just keep retrying (e.g. connection failed), and there’d have to be configurable max attempts type of thing so you don’t just fail forever.

If your use case is tolerant of repeated connection failures on initial start, you can wrap the existing service with one that doesn’t create/call the TemporalWorkerService’s ExecuteAsync until a client connection succeeds, trying over and over again to whatever threshold of retrying works best for your use case. It’s basically just adding a IHostedService singleton to the container. We do offer a “lazy” client that doesn’t try to connect until first needed, but of course a worker needs a connection right away so it can poll right away which effectively makes the client eager. We also check that the namespace exists and such as part of worker startup.