Have any body tried temporal multi cluster replication on differnt regions?

I am presently running a single cluster temporal. For DR reasons, i am planning to upgrade my setup to a multi cluster one running in two differnt regions on AWS (us west 2 and us east 2).

Have any body tried this already , if so please share your learnings.

For dxc replication on AWS what are the best practices?

Do i need to do a VPC peering between the regions?

IS TLS setup mandatory for Multi cluster setup ( assuming i have done VPC peering)?

Are the replication lags between a VPC Peering setup (without TLS) and over the internet replication with TLS comparable? or will the the lags be many folds higher in later case?

Hi @madhu

I can take a stab at answering some of your questions:

For DR reasons, i am planning to upgrade my setup to a multi cluster one running in two differnt regions on AWS (us west 2 and us east 2)

Currently a Temporal namespace is either local to a DC or global. Once its set it is immutable.

IS TLS setup mandatory for Multi cluster setup ( assuming i have done VPC peering)?

No, but since your data might be transmitted through public networks(s), TSL should probably be required in most cases.

1 Like

Thaks @tihomir, can you elobrate on global vs local name space?
For a DR/Replica scenario, should i set my namespace as global?
My name space is local presently.

Basically, when should i configure my namespace as global and when should it be local?

So, follow-up on this question. I am unable to get multi-region connectivity working.

I have two clusters set up in two different AWS regions. Each of the clusters works fine and my local java SDK-app can connect and use either cluster. I have ALBs fronting the frontend services. I’ve configured the clusterInformation section as defined in the Temporal Server Configuration docs using the dns:/// option as outlined in the docs pointing at my frontend ALBs. I’ve set up VPC peering between the two regions and have confirmed connectivity from the pods in one cluster to the frontend ALB in the other cluster. I even deployed a gRPC cli image to the clusters to verify that I could invoke the gRPC frontend service on the other cluster.

However, I’m seeing i/o timeout errors in the history logs of one cluster when connecting to the frontend of the other cluster:

    "level": "error",
    "ts": "2021-08-25T18:51:27.133Z",
    "msg": "Failed to get replication tasks",
    "service": "history",
    "error": "last connection error: connection error: desc = \"transport: Error while dialing dial tcp x.x.46.188:433: i/o timeout\"",
    "logging-call-at": "replicationTaskFetcher.go: 395",
    "stacktrace": "go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go: 143\ngo.temporal.io/server/service/history.(*replicationTaskFetcherWorker).getMessages\n\t/temporal/service/history/replicationTaskFetcher.go: 395\ngo.temporal.io/server/service/history.(*replicationTaskFetcherWorker).fetchTasks\n\t/temporal/service/history/replicationTaskFetcher.go: 338"

I’m wondering if it has something to do with the ALB configuration as the ALB health checks for the frontend ingress are being reported as unhealthy. I am unable to get the health checks to pass.

However, I’m also wondering why the dial is using an IP address to connect when I’m using dns://hostname:port in the cluster information (rpcAddress: dns:///alb-ingress.example.com:443). When I try the same gRPC service call using a gRPC cli deployed to a pod, it fails.

i was able to setup the same with single docker compose using this docker file

As far as the VPC setup goes, can you check if you have the routing table entries correctly configured?

Are you using kuberneters? if you you may want to create an internal load balancer for your temporal services xdc communication

also check your security groups (mostly timeouts are generally due to security group/missing routing table entries in vpc peering scenarios)

I would agree that network timeout issues are frequently security group related. However, since I’ve confirmed that the frontend service alb for each cluster correctly routes to the frontend service pod from a purpose-built pod with gRPCurl included and deployed to one cluster for the sole intent of verifying gRPC services in the other cluster are available (i…e. so I can kubectl exec to the pod and run: grpcurl --insecure frontend.othercluster.example.com:443 list), I’m fairly certain that I can eliminate security groups, VPC Peering or routing table issues as the cause as the grpcurl command to the other cluster works from within the pod.

Can someone confirm that temporal uses host names or IPs when connecting to other clusters in a multi-cluster setup? It appears to be using IPs instead of hostnames even though I am providing a hostname in the clusterInformation. I can confirm that IP-based routing does not work as my ALB has a routing rule that specifically routes to target groups based on hostname (i.e. grpcurl --insecure x.x.23.123:443 list does not work).

as far as dns is concerned i tried it works for me ( when connection two clusers), ofcouse i have tried it only through docker compose (with in same network).

  enableGlobalNamespace: true
  failoverVersionIncrement: 10
  masterClusterName: "primary"
  currentClusterName: "primary"
      enabled: true
      initialFailoverVersion: 1
      rpcName: "frontend"
      rpcAddress: "dns:///temporal-primary:7233"
      enabled: true
      initialFailoverVersion: 2
      rpcName: "frontend"
      rpcAddress: "dns:///temporal-secondary:7233"

So, I think I have finally figured out the correct EKS ingress annotation configurations to correctly configure the ALB ingress’s default rule to just forward all requests to the target group instead of sending a 404 to all requests that don’t match the host/path. This change finally allows allows ip-based grpc requests to actually make it to the ingress/kube layer.

Now that I’m sure all requests are making it to the kube layer, I am pretty sure that temporal is using IPs (without an Authority header) when connecting to the rpcAddress configured in the clusterInformation, even when a dns name is specified. Using IPs instead of hostnames works fine for cluster-local resolution as the rpcAddress is usually the service name which will ultimately resolve to the cluster-local service IP and traffic will route correctly as it does not rely on the host name at all.

However, when a kubernetes ingress is involved (i.e. cluster in east1 connecting to cluster in east2), using an IP to connect to the other cluster (without providing an authority header) causes the ingress rules in the kube layer to not route the request to the service as it will not be able to match the request to any of the hosts specified in the ingress rule (and an ingress rule host record cannot have an IP address).

I believe I have replicated what the pods are seeing (i/o timeout) in a separately deployed pod I’ve deployed with the grpcurl cli tool that I am using to test gRPC connectivity to the other cluster’s ingress. What I’ve found using the grpcurl tool in this pod is that when I try to connect to the ingress using just IP address:

grpcurl --insecure x.x.23.144:443 list

I get a connection timeout

Failed to dial target host "x.x.23.144:433": context deadline exceeded

However, if I provide an Authority header when connecting using the ip I get the same successful response as I would get when I use the hostname :

## From an east1 cluster pod
bash-5.1$ grpcurl --insecure -authority myfrontend.temporal-eks-east2.example.com x.x.23.144:443 list

Still not sure what the resolution is though.

I havent tried it on kuber netes yet,
by i was think of using an non internet facing nlb with the following annoations to make to two clusers talk to each other

    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
    service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: '60' 
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: 'true'
    service.beta.kubernetes.io/aws-load-balancer-type: nlb

I have already verified that hostname-based grpc calls work fine from the east1 cluster to the east2 cluster. I’m fairly certain that the ALB in east2 is routing all traffic from east1 fine. I think my problem is when IP-only gRPC requests are sent, the ingress controller in the kubernetes layer inspects the host-name on the request to try to match it to a routing rule defined in the ingress yaml and since no host name is included on the request it doesn’t find a match making the request unroutable and is therefore dropped, causing the i/o timeout. I can’t find another logical reason for the issue I’m seeing. If temporal were using a hostname, I’m fairly certain it would be working.

Also, I find it weird that the temporal helm charts do not include a frontend ingress definition. I had to add one manually to my local fork. For anyone looking for a (mostly) working AWS frontend ingress definition, my frontend ingress looks something like this (note health check still is failing, not sure why, also this config isn’t using ssl from the alb to the frontend pod).

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
    alb.ingress.kubernetes.io/actions.default-rule: |
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig":
      { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
    alb.ingress.kubernetes.io/backend-protocol: HTTP
    alb.ingress.kubernetes.io/backend-protocol-version: GRPC
    alb.ingress.kubernetes.io/certificate-arn: my-arn
    alb.ingress.kubernetes.io/healthcheck-path: /grpc.health.v1.Health/Check
    alb.ingress.kubernetes.io/healthcheck-protocol: HTTP
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/load-balancer-attributes: routing.http2.enabled=true
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/security-groups: my-sec-group
    alb.ingress.kubernetes.io/tags: env=dev
    alb.ingress.kubernetes.io/target-type: ip
    kubernetes.io/ingress.class: alb
    app.kubernetes.io/component: frontend
    app.kubernetes.io/instance: temporal
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: temporal
    app.kubernetes.io/part-of: temporal
    app.kubernetes.io/version: 1.11.3
    helm.sh/chart: temporal-0.11.3
  name: temporal-frontend-ingress
    serviceName: default-rule
    servicePort: use-annotation
  - host: frontend-east2.temporal.example.com
      - backend:
          serviceName: temporal-frontend
          servicePort: 7233
        path: /*