Replication crashes history service

Hi there,

We had a working replication setup between two clusters A (active) and B (standby). We then rolled out a change to our mTLS configuration which meant that the hostname of for the gRPC address changed. Prior to rolling out the change, we stopped all ongoing namespace replication and also removed the cluster connection. After the migration we then tried to re-enable the cluster connection, however, the moment we enabled the connection, cluster A crashed. There seems to be a problem with the history shards. The error logs are filling up with errors of the form shard status unknown, context deadline exceeded and Queue reader unable to retrieve tasks. The underlying Cassandra instance seems to get flooded with requests and crashes. Within a minute service is lost.

We can recover by scaling down cluster B. Cluster A will then recover almost immediately.

We assume there is some stale, inconsistent replication data which is causing this. We tried starting over by connecting a new cluster with different name and failover increment number, but no luck. The moment the connection gets enabled the active server starts crashing.

We are still assuming some stale data (most likely somewhere in the reader states persisted as part of the shard information). Any idea on what’s going on and how we could clean this up?

Full disclosure, part of the mTLS setup change was to drive the replication traffic through the internal frontend now. In our staging environments where we started replication from scratch that seems to work fine. We only ran into this issue in a cluster we we had replication ongoing and then switched the setup.

Thanks for any help,
Hardy

Is there a stack trace that you can capture prior to the crash?

Hi,

thanks for the reponse.

Unfortunately not. With crashing I meant that the Cassandra the CPU idle of the Cassandra nodes went to 0% which caused the nodes to be down/unresponsive.

There are thousands of errors. Most are likely the outcome of the system becoming overloaded and errors due to context timeouts and unknown shard status. I was so far not able to isolate something which looked like the root error.

Are rate limits applied when going through the internal frontend? Just wondering whether this could be a problem? That said, this happens even before a namespace gets replicated. Just connecting the clusters.

If there is any stale data from the previously configured replication, I have so far not been able to isolate it.

–Hardy

Did you remove the remote cluster from all your namespace config prior to disconnecting the clusters?

Yes. There was only one namespace configured for replication.

First we removed cluster B from the list of clusters for the replicated namespace:

temporal operator namespace update -n foo \
--address <address-clusrer-A>:443 \
--cluster A 

Then we removed the cluster connection:

temporal operator cluster remove \
--address <address-clusrer-A>:443 \
--name B

I suspected that some of the reader queue states which are stored as part of the shard information could be causing the issue, but I am not able to confirm that. I used tdbg shard describe hoping that I could see some replication related state somewhere, but no luck. Just a guess anyways.

Interesting. So the mental model here is that the active cluster here will generate replication tasks for global namespaces, and the standby cluster will poll these tasks.
As soon as you remove the standby cluster from the namespace record (you should verify this from both side), active side will stop generating replication tasks. There could be some existing replication tasks in the queue but eventually they will be drained. If not, these will be the only tasks remain in the system upon re-connection.

I assume this change will not change the cluster name, and also have you verified that the clusters can communicate with each other? Or is it too un-healthy to verify at the time?

So the mental model here is that the active cluster here will generate replication tasks for global namespaces, and the standby cluster will poll these tasks.
As soon as you remove the standby cluster from the namespace record (you should verify this from both side), active side will stop generating replication tasks. There could be some existing replication tasks in the queue but eventually they will be drained. If not, these will be the only tasks remain in the system upon re-connection.

Right. That’s my mental model as well. Thanks for confirming.
When initially trying to re-enable replication, I was thinking the errors might be transient and related to this “draining”, but then active cluster became completely unresponsive.

I assume this change will not change the cluster name,

Not initially. But after the first failure, we did first completely rebuild the passive cluster (it only contained the passive replicated namespace). We kept the name, but changes the failover increment version. Once this did nor work either, we even changed the name of the passive cluster. So in my mental model, when connecting this “new” cluster, it appears as a completely new remote for the active cluster. There is nothing which needs to be send to this cluster until actual namespace replication is enabled.

and also have you verified that the clusters can communicate with each other?

Yes, clusters can tlk to each other. We can eg see GetReplicationMessages being called. Also adding the cluster connection does not work if the remote cluster is not reachable.

Or is it too un-healthy to verify at the time?

The cluster is healthy atm and performs as usual. It becomes unhealthy the moment I add and enable a remote connection.

You should not change failover increment version. It should be a large number and the same number across all the cells you manage in your env.
Other than that your understandings are correct.

You should not change failover increment version.

My bad, I meant the initial failover version. We wanted to make sure that when we changes the cluster name, we also changed the initial failover version.

It should be a large number and the same number across all the cells you manage in your env.

Right, if the failover increment versions between clusters don’t match, you cannot even connect them.

Sorry we didn’t provide much help for this case. Might be easier if we can see the relevant history service logs.

Sure, thanks for your help so far. I understand this is a tricky one, since I don’t have a simple error/panic to point to. I thought you might be able to point out something generally wrong. Not sure whether we will ever be able to recover this cluster for replication.