Temporal client failure when server restarts

Hi everyone.
I’m using temporalio python SDK version 0.1b2 and I see that if Temporal Server restarts, the error bellow appears on client, and pending workflows are not executed anymore by this client:

So, how to restart the client/application, or retry connection automatically, if connection with server is lost in any moment?

The SDK is at 1.4.0. You should upgrade!

I’d be interested to know what the recommended way of autoreconnecting as with the latest client SDK as if the Temporal server is restarted the client does not autoreconnect.

With a bit of trial and error trying to work out how to solve this, I found that the act of calling check_health() on the Temporal service client revives the connection after the Temporal server restarts so I am using this:

import asyncio
import logging
from typing import Optional
from temporalio.client import Client

TEMPORAL_SERVER = "localhost:7233"
TEMPORAL_NAMESPACE = "default"

logger = logging.getLogger(__name__)

class TemporalClientManager:
    def __init__(self) -> None:
        self.client: Optional[Client] = None
        self.client_health_check_period = 10

    async def connect(self) -> None:
        # Start task to periodically reconnect to Temporal if connection drops
        asyncio.create_task(self._keep_alive())
        # Make initial connection
        logger.info(f"Connecting to Temporal server {TEMPORAL_SERVER}")
        self.client = await Client.connect(
            target_host=TEMPORAL_SERVER,
            namespace=TEMPORAL_NAMESPACE,
        )

    async def _keep_alive(self) -> None:
        while True:
            # If disconnected, act of checking health appears to reconnect client
            await asyncio.sleep(self.client_health_check_period)
            if await self.is_connected():
                logger.debug(
                    f"Connection to Temporal server '{TEMPORAL_SERVER}' is alive"
                )
            else:
                logger.error(
                    f"Connection to Temporal server '{TEMPORAL_SERVER}' failed"
                )

    async def is_connected(self) -> bool:
        try:
            return await self.client.service_client.check_health()
        except Exception as e:
            logger.warning(f"Failed to check Temporal health: {e}")
            return False

temporal_client_mgr = TemporalClientManager()

I am not confident this is a good solution as I periodically get a “transport error” when running the health check:

17:51:39.903 | DEBUG | _keep_alive:95 | :lady_beetle: Connection to Temporal server ‘localhost:7233’ is alive
17:51:49.914 | WARNING | is_connected:108 | :warning: Failed to check Temporal health: transport error
17:51:49.915 | ERROR | _keep_alive:100 | :x: Connection to Temporal server ‘localhost:7233’ failed
17:51:59.927 | DEBUG | _keep_alive:95 | :lady_beetle: Connection to Temporal server ‘localhost:7233’ is alive
.
.
.
17:57:10.360 | DEBUG | _keep_alive:95 | :lady_beetle: Connection to Temporal server ‘localhost:7233’ is alive
17:57:20.364 | WARNING | is_connected:108 | :warning: Failed to check Temporal health: transport error
17:57:20.364 | ERROR | _keep_alive:100 | :x: Connection to Temporal server ‘localhost:7233’ failed
17:57:30.378 | DEBUG | _keep_alive:95 | :lady_beetle: Connection to Temporal server ‘localhost:7233’ is alive
17:57:40.401 | DEBUG | _keep_alive:95 | :lady_beetle: Connection to Temporal server ‘localhost:7233’ is alive

It looks like there is an open issue for this the client not auto-reconnecting:

High-level client calls should automatically retry if they fail because of a connection is no longer available (as opposed to low level calls on one of the service fields, but you can set retry on those too). 1.4.0 now has keep alive built in by default (30s interval, 15s timeout). Which situations are you making a call after server restart that is failing? Is it a high level call like start workflow?

EDIT: I just updated/closed the GH issue with a test I performed confirming worker and client both recover

@Chad_Retz - please accept my apologies. I added my workaround when I was running 1.3.0 but didn’t retest on 1.4.0 because I found #397 open.

I only require autoreconnect to work for high-level calls so after reading your reply I removed my workaround code and found version 1.4.0 does reconnect automatically.

Thanks for the quick response.