Hello
I was hoping to test out the multi-cluster replication feature of Temporal locally in docker to test out a few scenarios and answer a few questions, but have been unable to get this feature working as far as I can tell.
Some observations:
- When I create a global namespace in clusterA I don’t see it replicate to clusterB
- When I execute a workflow on clusterA, I don’t see any replication taking place (checking ES documents, clusterB’s UI, etc)
- When I shut off
temporal_a
container, I see connection errors intemporal_b
indicating “Failed to get replication tasks” which to me indicates that replication is setup to some degree. Shutting downtemporal_b
I see errors intemporal_a
as well.
I have a branch available here which has my changes to the docker-compose repository. I’ve updated the docker-compose file to have two instances of Temporal (A and B) and related dependencies (ES, postgres, etc). I’ve also updated the development.yaml in each env to define clusterMetadata as required following the guide.
Steps to reproduce:
- run
docker-compose up -d
in the root of the repo - Add connections
# Add cluster B connection into cluster A
tctl -address 127.0.0.1:7233 admin cluster upsert-remote-cluster --frontend_address "temporal_b:7233"
# Add cluster A connection into cluster B
tctl -address 127.0.0.1:8233 admin cluster upsert-remote-cluster --frontend_address "temporal_a:7233"
- Create namespace
tctl --address "localhost:7233" --namespace "workflow-engine-sandbox" namespace register --description "<description>" --retention "1" --global_namespace "true"
Below is the clusterMetadata section for clusterA:
clusterMetadata:
enableGlobalNamespace: true
failoverVersionIncrement: 100
masterClusterName: "clusterA"
currentClusterName: "clusterA"
clusterInformation:
clusterA:
enabled: true
initialFailoverVersion: 1
rpcAddress: "temporal_a:7233"
clusterB:
enabled: true
initialFailoverVersion: 2
rpcAddress: "temporal_b:7233"
clusterMetadata section for clusterB:
clusterMetadata:
enableGlobalNamespace: true
failoverVersionIncrement: 100
masterClusterName: "clusterA"
currentClusterName: "clusterB"
clusterInformation:
clusterA:
enabled: true
initialFailoverVersion: 1
rpcAddress: "temporal_a:7233"
clusterB:
enabled: true
initialFailoverVersion: 2
rpcAddress: "temporal_b:7233"
I don’t see anything obvious in the container logs which indicate a problem. I’ve tested each cluster independently as well to confirm each is operational when multi-cluster replication is disabled, and am able to run workflows on each in that scenario.
Output of a few tctl commands:
$ tctl --address localhost:7233 admin cluster list
[
{
"cluster_name": "clusterA",
"history_shard_count": 4,
"cluster_id": "a4cec10d-e45f-4b85-9ba5-9d2588f452b6",
"cluster_address": "temporal_a:7233",
"failover_version_increment": 100,
"initial_failover_version": 1,
"is_global_namespace_enabled": true,
"is_connection_enabled": true
},
{
"cluster_name": "clusterB",
"history_shard_count": 4,
"cluster_id": "6421c3c4-88a7-4f09-8815-a18db6e4ec4a",
"cluster_address": "temporal_b:7233",
"failover_version_increment": 100,
"initial_failover_version": 2,
"is_global_namespace_enabled": true,
"is_connection_enabled": true
}
]
$ tctl --address localhost:8233 admin cluster list
[
{
"cluster_name": "clusterA",
"history_shard_count": 4,
"cluster_id": "a4cec10d-e45f-4b85-9ba5-9d2588f452b6",
"cluster_address": "temporal_a:7233",
"failover_version_increment": 100,
"initial_failover_version": 1,
"is_global_namespace_enabled": true,
"is_connection_enabled": true
},
{
"cluster_name": "clusterB",
"history_shard_count": 4,
"cluster_id": "6421c3c4-88a7-4f09-8815-a18db6e4ec4a",
"cluster_address": "temporal_b:7233",
"failover_version_increment": 100,
"initial_failover_version": 2,
"is_global_namespace_enabled": true,
"is_connection_enabled": true
}
]
$ tctl --address localhost:7233 admin cluster describe
{
"supportedClients": {
"temporal-cli": "\u003c2.0.0",
"temporal-go": "\u003c2.0.0",
"temporal-java": "\u003c2.0.0",
"temporal-php": "\u003c2.0.0",
"temporal-server": "\u003c2.0.0",
"temporal-typescript": "\u003c2.0.0",
"temporal-ui": "\u003c3.0.0"
},
"serverVersion": "1.17.3",
"membershipInfo": {
"currentHost": {
"identity": "172.20.0.6:7233"
},
"reachableMembers": [
"172.20.0.6:6934",
"172.20.0.6:6935",
"172.20.0.6:6933",
"172.20.0.6:6939"
],
"rings": [
{
"role": "frontend",
"memberCount": 1,
"members": [
{
"identity": "172.20.0.6:7233"
}
]
},
{
"role": "history",
"memberCount": 1,
"members": [
{
"identity": "172.20.0.6:7234"
}
]
},
{
"role": "matching",
"memberCount": 1,
"members": [
{
"identity": "172.20.0.6:7235"
}
]
},
{
"role": "worker",
"memberCount": 1,
"members": [
{
"identity": "172.20.0.6:7239"
}
]
}
]
},
"clusterId": "a4cec10d-e45f-4b85-9ba5-9d2588f452b6",
"clusterName": "clusterA",
"historyShardCount": 4,
"persistenceStore": "postgres",
"visibilityStore": "postgres,elasticsearch",
"failoverVersionIncrement": "100",
"initialFailoverVersion": "1",
"isGlobalNamespaceEnabled": true
}
$ tctl --address localhost:8233 admin cluster describe
{
"supportedClients": {
"temporal-cli": "\u003c2.0.0",
"temporal-go": "\u003c2.0.0",
"temporal-java": "\u003c2.0.0",
"temporal-php": "\u003c2.0.0",
"temporal-server": "\u003c2.0.0",
"temporal-typescript": "\u003c2.0.0",
"temporal-ui": "\u003c3.0.0"
},
"serverVersion": "1.17.3",
"membershipInfo": {
"currentHost": {
"identity": "172.20.0.7:7233"
},
"reachableMembers": [
"172.20.0.7:6933",
"172.20.0.7:6934",
"172.20.0.7:6935",
"172.20.0.7:6939"
],
"rings": [
{
"role": "frontend",
"memberCount": 1,
"members": [
{
"identity": "172.20.0.7:7233"
}
]
},
{
"role": "history",
"memberCount": 1,
"members": [
{
"identity": "172.20.0.7:7234"
}
]
},
{
"role": "matching",
"memberCount": 1,
"members": [
{
"identity": "172.20.0.7:7235"
}
]
},
{
"role": "worker",
"memberCount": 1,
"members": [
{
"identity": "172.20.0.7:7239"
}
]
}
]
},
"clusterId": "6421c3c4-88a7-4f09-8815-a18db6e4ec4a",
"clusterName": "clusterB",
"historyShardCount": 4,
"persistenceStore": "postgres",
"visibilityStore": "postgres,elasticsearch",
"failoverVersionIncrement": "100",
"initialFailoverVersion": "2",
"isGlobalNamespaceEnabled": true
}
When I try to failover the namespace to the B cluster, I get the following error:
$ tctl --namespace workflow-engine-sandbox --address localhost:7233 namespace update --active_cluster clusterB
Will set active cluster name to: clusterB, other flag will be omitted.
Error: Operation UpdateNamespace failed.
Error Details: rpc error: code = InvalidArgument desc = Active cluster is not contained in all clusters.
('export TEMPORAL_CLI_SHOW_STACKS=1' to see stack traces)
I’m not 100% sure what this error means, but seems to indicate that the active cluster (clusterA) is not contained in the standby cluster (clusterB)? However, when I list clusters on B (shown above) I see both clusters listed.
Some other ideas:
- My local docker setup is using the auto-setup image, could this be disabling the replication somehow?
- I’m not using TLS, maybe that’s required for replication to work?
Thank you!