Failed to retrieve replication message and Failed to get replication tasks from client errors

I see my history and front end logs full of these messages

Front end ::

{"level":"warn","ts":"2022-03-10T13:55:36.891Z","msg":"Failed to get replication tasks from client","service":"frontend","error":"context deadline exceeded","logging-call-at":"client.go:916"}

History :::
{"level":"error","ts":"2022-03-10T13:58:47.971Z","msg":"Failed to retrieve replication messages.","shard-id":78,"address":"10.x.yyy.zzz:7234","component":"history-engine","error":"context deadline exceeded","logging-call-at":"historyEngine.go:3000","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*historyEngineImpl).GetReplicationMessages\n\t/temporal/service/history/historyEngine.go:3000\ngo.temporal.io/server/service/history.(*Handler).GetReplicationMessages.func1\n\t/temporal/service/history/handler.go:1212"}

This is happening in temporal server 15.0.0
Please note, i recently attemped upgrading to 15.0.2 and it failed (so my schema could be in 15.0.2 where as frontend/history etc could be in 15.0.0)… can that be the issue?

Schema changes apply for minor version releases, not patch versions, so updating from 15.0.0 to 15.0.2 should not have included any schema changes.

i recently attemped upgrading to 15.0.2 and it failed

What was the failure? Checked with server team and they mentioned workflow lock contention could cause the issue with fetching replication tasks with context deadline exceeded errors.

correct. thats what my assumption is as well.

Well i have an xdc setup, and my primary/secondary clusters urls are configured as

clusterMetadata:
  enableGlobalNamespace: false
  failoverVersionIncrement: 100
  masterClusterName: "clusterA"
  currentClusterName: "clusterA"
  clusterInformation:
    clusterA:
      enabled: true
      initialFailoverVersion: 1
      rpcAddress: "dns:///myprimarycluster:7233"
    clusterB:
      enabled: true
      initialFailoverVersion: 1
      rpcAddress: "dns:///mystandbycluster:7233"

{“level”:“fatal”,“ts”:“2022-03-07T17:30:34.201Z”,“msg”:“Invalid rpcAddress for remote cluster”,“error”:"address dns:///tmystandbycluster:7233: too many colons in address "

and the cluster fails to start up…

if i change the dns to plain ip it starts…
at times clearing the cluster_info and cluster_info_metadata table helps…

I am not able to get hold of a defenative configuration/ guide for configuring and running xdc clusters.

The check is done here via ipsock → SplitHostPort

I am not able to get hold of a defenative configuration/ guide for configuring and running xdc clusters.

Checking with server team on this and will get back.

address dns:///tmystandbycluster:7233: too many colons in address "

wondering what is attaching the extra “:” after port as well as the “t”?

t is typo… the extra : caught my attention too… i thought its possibly some logger , or its getting added in

I think the problem here is a bug in Temporal and the fact that the dns:/// prefix is not supported (anymore). See also Unable to use dns:// prefix for clusterInformation.rpcAddress configuration · Issue #5979 · temporalio/temporal · GitHub.

The documentation says that dns:/// can be used and it would be the only way to force grpc-go to use the DNS resolver, but given the current code and the use of net.SplitHostPort I don’t think this is possible.