Running into problems testing multi-cluster replication locally via docker

Hello :wave:
I was hoping to test out the multi-cluster replication feature of Temporal locally in docker to test out a few scenarios and answer a few questions, but have been unable to get this feature working as far as I can tell.

Some observations:

  • When I create a global namespace in clusterA I don’t see it replicate to clusterB
  • When I execute a workflow on clusterA, I don’t see any replication taking place (checking ES documents, clusterB’s UI, etc)
  • When I shut off temporal_a container, I see connection errors in temporal_b indicating “Failed to get replication tasks” which to me indicates that replication is setup to some degree. Shutting down temporal_b I see errors in temporal_a as well.

I have a branch available here which has my changes to the docker-compose repository. I’ve updated the docker-compose file to have two instances of Temporal (A and B) and related dependencies (ES, postgres, etc). I’ve also updated the development.yaml in each env to define clusterMetadata as required following the guide.

Steps to reproduce:

  1. run docker-compose up -d in the root of the repo
  2. Add connections
# Add cluster B connection into cluster A
tctl -address 127.0.0.1:7233 admin cluster upsert-remote-cluster --frontend_address "temporal_b:7233"
# Add cluster A connection into cluster B
tctl -address 127.0.0.1:8233 admin cluster upsert-remote-cluster --frontend_address "temporal_a:7233"
  1. Create namespace
tctl --address "localhost:7233" --namespace "workflow-engine-sandbox" namespace register --description "<description>" --retention "1" --global_namespace "true"

Below is the clusterMetadata section for clusterA:

clusterMetadata:
  enableGlobalNamespace: true
  failoverVersionIncrement: 100
  masterClusterName: "clusterA"
  currentClusterName: "clusterA"
  clusterInformation:
    clusterA:
      enabled: true
      initialFailoverVersion: 1
      rpcAddress: "temporal_a:7233"
    clusterB:
      enabled: true
      initialFailoverVersion: 2
      rpcAddress: "temporal_b:7233"

clusterMetadata section for clusterB:

clusterMetadata:
  enableGlobalNamespace: true
  failoverVersionIncrement: 100
  masterClusterName: "clusterA"
  currentClusterName: "clusterB"
  clusterInformation:
    clusterA:
      enabled: true
      initialFailoverVersion: 1
      rpcAddress: "temporal_a:7233"
    clusterB:
      enabled: true
      initialFailoverVersion: 2
      rpcAddress: "temporal_b:7233"

I don’t see anything obvious in the container logs which indicate a problem. I’ve tested each cluster independently as well to confirm each is operational when multi-cluster replication is disabled, and am able to run workflows on each in that scenario.

Output of a few tctl commands:

$ tctl --address localhost:7233 admin cluster list
[
  {
    "cluster_name": "clusterA",
    "history_shard_count": 4,
    "cluster_id": "a4cec10d-e45f-4b85-9ba5-9d2588f452b6",
    "cluster_address": "temporal_a:7233",
    "failover_version_increment": 100,
    "initial_failover_version": 1,
    "is_global_namespace_enabled": true,
    "is_connection_enabled": true
  },
  {
    "cluster_name": "clusterB",
    "history_shard_count": 4,
    "cluster_id": "6421c3c4-88a7-4f09-8815-a18db6e4ec4a",
    "cluster_address": "temporal_b:7233",
    "failover_version_increment": 100,
    "initial_failover_version": 2,
    "is_global_namespace_enabled": true,
    "is_connection_enabled": true
  }
]

$ tctl --address localhost:8233 admin cluster list
[
  {
    "cluster_name": "clusterA",
    "history_shard_count": 4,
    "cluster_id": "a4cec10d-e45f-4b85-9ba5-9d2588f452b6",
    "cluster_address": "temporal_a:7233",
    "failover_version_increment": 100,
    "initial_failover_version": 1,
    "is_global_namespace_enabled": true,
    "is_connection_enabled": true
  },
  {
    "cluster_name": "clusterB",
    "history_shard_count": 4,
    "cluster_id": "6421c3c4-88a7-4f09-8815-a18db6e4ec4a",
    "cluster_address": "temporal_b:7233",
    "failover_version_increment": 100,
    "initial_failover_version": 2,
    "is_global_namespace_enabled": true,
    "is_connection_enabled": true
  }
]

$ tctl --address localhost:7233 admin cluster describe
{
  "supportedClients": {
    "temporal-cli": "\u003c2.0.0",
    "temporal-go": "\u003c2.0.0",
    "temporal-java": "\u003c2.0.0",
    "temporal-php": "\u003c2.0.0",
    "temporal-server": "\u003c2.0.0",
    "temporal-typescript": "\u003c2.0.0",
    "temporal-ui": "\u003c3.0.0"
  },
  "serverVersion": "1.17.3",
  "membershipInfo": {
    "currentHost": {
      "identity": "172.20.0.6:7233"
    },
    "reachableMembers": [
      "172.20.0.6:6934",
      "172.20.0.6:6935",
      "172.20.0.6:6933",
      "172.20.0.6:6939"
    ],
    "rings": [
      {
        "role": "frontend",
        "memberCount": 1,
        "members": [
          {
            "identity": "172.20.0.6:7233"
          }
        ]
      },
      {
        "role": "history",
        "memberCount": 1,
        "members": [
          {
            "identity": "172.20.0.6:7234"
          }
        ]
      },
      {
        "role": "matching",
        "memberCount": 1,
        "members": [
          {
            "identity": "172.20.0.6:7235"
          }
        ]
      },
      {
        "role": "worker",
        "memberCount": 1,
        "members": [
          {
            "identity": "172.20.0.6:7239"
          }
        ]
      }
    ]
  },
  "clusterId": "a4cec10d-e45f-4b85-9ba5-9d2588f452b6",
  "clusterName": "clusterA",
  "historyShardCount": 4,
  "persistenceStore": "postgres",
  "visibilityStore": "postgres,elasticsearch",
  "failoverVersionIncrement": "100",
  "initialFailoverVersion": "1",
  "isGlobalNamespaceEnabled": true
}

$ tctl --address localhost:8233 admin cluster describe
{
  "supportedClients": {
    "temporal-cli": "\u003c2.0.0",
    "temporal-go": "\u003c2.0.0",
    "temporal-java": "\u003c2.0.0",
    "temporal-php": "\u003c2.0.0",
    "temporal-server": "\u003c2.0.0",
    "temporal-typescript": "\u003c2.0.0",
    "temporal-ui": "\u003c3.0.0"
  },
  "serverVersion": "1.17.3",
  "membershipInfo": {
    "currentHost": {
      "identity": "172.20.0.7:7233"
    },
    "reachableMembers": [
      "172.20.0.7:6933",
      "172.20.0.7:6934",
      "172.20.0.7:6935",
      "172.20.0.7:6939"
    ],
    "rings": [
      {
        "role": "frontend",
        "memberCount": 1,
        "members": [
          {
            "identity": "172.20.0.7:7233"
          }
        ]
      },
      {
        "role": "history",
        "memberCount": 1,
        "members": [
          {
            "identity": "172.20.0.7:7234"
          }
        ]
      },
      {
        "role": "matching",
        "memberCount": 1,
        "members": [
          {
            "identity": "172.20.0.7:7235"
          }
        ]
      },
      {
        "role": "worker",
        "memberCount": 1,
        "members": [
          {
            "identity": "172.20.0.7:7239"
          }
        ]
      }
    ]
  },
  "clusterId": "6421c3c4-88a7-4f09-8815-a18db6e4ec4a",
  "clusterName": "clusterB",
  "historyShardCount": 4,
  "persistenceStore": "postgres",
  "visibilityStore": "postgres,elasticsearch",
  "failoverVersionIncrement": "100",
  "initialFailoverVersion": "2",
  "isGlobalNamespaceEnabled": true
}

When I try to failover the namespace to the B cluster, I get the following error:

$ tctl --namespace workflow-engine-sandbox --address localhost:7233 namespace update --active_cluster clusterB
Will set active cluster name to: clusterB, other flag will be omitted.
Error: Operation UpdateNamespace failed.
Error Details: rpc error: code = InvalidArgument desc = Active cluster is not contained in all clusters.
('export TEMPORAL_CLI_SHOW_STACKS=1' to see stack traces)

I’m not 100% sure what this error means, but seems to indicate that the active cluster (clusterA) is not contained in the standby cluster (clusterB)? However, when I list clusters on B (shown above) I see both clusters listed.

Some other ideas:

  • My local docker setup is using the auto-setup image, could this be disabling the replication somehow?
  • I’m not using TLS, maybe that’s required for replication to work?

Thank you! :slight_smile:

Thanks for sharing your updates and results, testing it out and will report back as soon as have some info.

1 Like

The command is not correct. It should be (for local)

tctl -address 127.0.0.1:7233 admin cluster upsert-remote-cluster --frontend_address "127.0.0.1:8233"
# Add cluster A connection into cluster B
tctl -address 127.0.0.1:8233 admin cluster upsert-remote-cluster --frontend_address "127.0.0.1:7233"

@yux I tried your command and am getting the following error(s):

$ tctl -address 127.0.0.1:7233 admin cluster upsert-remote-cluster --frontend_address "127.0.0.1:8233"
Error: Operation AddOrUpdateRemoteCluster failed.
Error Details: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:8233: connect: connection refused"
('export TEMPORAL_CLI_SHOW_STACKS=1' to see stack traces)

$ tctl -address 127.0.0.1:8233 admin cluster upsert-remote-cluster --frontend_address "127.0.0.1:7233"
Error: Operation AddOrUpdateRemoteCluster failed.
Error Details: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:7233: connect: connection refused"
('export TEMPORAL_CLI_SHOW_STACKS=1' to see stack traces)

I think because within docker, each temporal instance is available via temporal_a:7233 and temporal_b:7233 but on my local machine each one is available at localhost:7233 (for a) and localhost:8233 (for b)

If it helps maybe you could try running these commands from admin-tools container inside docker, then you should be able to use temporal_a:7233, temporal_b:7233 ?

@tihomir I’m able to upsert the clusters without issue, I can do it within the admin tools docker container or from my local host. I am able to successfully execute the two upsert-remote-cluster commands. After I do that, I can also issue a list call against each cluster and observe that both clusters are added in each env.

If I run the upsert command within the admin tools container, the upsert commands look like:

# tctl --address temporal_a:7233 admin cluster upsert-remote-cluster --frontend_address "temporal_b:7233"
# tctl --address temporal_b:7233 admin cluster upsert-remote-cluster --frontend_address "temporal_a:7233"

If I run the upsert commands from my local machine, they look like:

$ tctl --address 127.0.0.1:7233 admin cluster upsert-remote-cluster --frontend_address "temporal_b:7233"
$ tctl --address 127.0.0.1:8233 admin cluster upsert-remote-cluster --frontend_address "temporal_a:7233"

However, after I create the global namespace in cluster A and execute a few workflows, I would expect to see the namespace in cluster B’s UI along with any workflow executions, but that doesn’t seem to be happening.

What’s strange to me is that if I stop either temporal_a OR temporal_b, I begin to see connections errors in the opposing cluster, which indicates to me that the clusters are configured property, but are just not replicating anything.

I think I got to the bottom of it! :slight_smile:

So replication was working all along. The issue appears to have been with how I was registering the namespace after upserting the clusters. I wasn’t aware that when registering a global namespace, you need to specify the list of associated clusters for the namespace using the --clusters argument.

When I register my test namespace like so, everything appears to work as expected, I see the namespace and workflow executions replicate:

tctl --address "localhost:7233" --namespace "workflow-engine-sandbox" namespace register --description "<description>" --retention "1" --global_namespace "true" --ac "clusterA" --clusters clusterA clusterB

Previously, I was registering the namespace without the --ac or --clusters flags. Further, the TCTL help output doesn’t really specify how to format the list of clusters for the --clusters flag which also threw me off.

I happened to stumble across the TCTL documentation for registering a namespace and saw these additional arguments which pointed me in the right direction. I think it would be helpful if the XDR replication documentation made a reference to these necessary flags.

Hopefully this helps someone else, thanks for help everyone!

1 Like