Failed to retrieve replication message and Failed to get replication tasks from client errors

madhu · March 10, 2022, 2:01pm

I see my history and front end logs full of these messages

Front end ::

{"level":"warn","ts":"2022-03-10T13:55:36.891Z","msg":"Failed to get replication tasks from client","service":"frontend","error":"context deadline exceeded","logging-call-at":"client.go:916"}

History :::
{"level":"error","ts":"2022-03-10T13:58:47.971Z","msg":"Failed to retrieve replication messages.","shard-id":78,"address":"10.x.yyy.zzz:7234","component":"history-engine","error":"context deadline exceeded","logging-call-at":"historyEngine.go:3000","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*historyEngineImpl).GetReplicationMessages\n\t/temporal/service/history/historyEngine.go:3000\ngo.temporal.io/server/service/history.(*Handler).GetReplicationMessages.func1\n\t/temporal/service/history/handler.go:1212"}

This is happening in temporal server 15.0.0
Please note, i recently attemped upgrading to 15.0.2 and it failed (so my schema could be in 15.0.2 where as frontend/history etc could be in 15.0.0)… can that be the issue?

tihomir · March 10, 2022, 4:53pm

Schema changes apply for minor version releases, not patch versions, so updating from 15.0.0 to 15.0.2 should not have included any schema changes.

i recently attemped upgrading to 15.0.2 and it failed

What was the failure? Checked with server team and they mentioned workflow lock contention could cause the issue with fetching replication tasks with context deadline exceeded errors.

madhu · March 10, 2022, 5:01pm

correct. thats what my assumption is as well.

Well i have an xdc setup, and my primary/secondary clusters urls are configured as

clusterMetadata:
  enableGlobalNamespace: false
  failoverVersionIncrement: 100
  masterClusterName: "clusterA"
  currentClusterName: "clusterA"
  clusterInformation:
    clusterA:
      enabled: true
      initialFailoverVersion: 1
      rpcAddress: "dns:///myprimarycluster:7233"
    clusterB:
      enabled: true
      initialFailoverVersion: 1
      rpcAddress: "dns:///mystandbycluster:7233"

{“level”:“fatal”,“ts”:“2022-03-07T17:30:34.201Z”,“msg”:“Invalid rpcAddress for remote cluster”,“error”:"address dns:///tmystandbycluster:7233: too many colons in address "

and the cluster fails to start up…

if i change the dns to plain ip it starts…
at times clearing the cluster_info and cluster_info_metadata table helps…

I am not able to get hold of a defenative configuration/ guide for configuring and running xdc clusters.

tihomir · March 10, 2022, 5:10pm

The check is done here via ipsock → SplitHostPort

I am not able to get hold of a defenative configuration/ guide for configuring and running xdc clusters.

Checking with server team on this and will get back.

tihomir · March 10, 2022, 5:15pm

address dns:///tmystandbycluster:7233: too many colons in address "

wondering what is attaching the extra “:” after port as well as the “t”?

madhu · March 10, 2022, 5:32pm

t is typo… the extra : caught my attention too… i thought its possibly some logger , or its getting added in

hferentschik · July 8, 2024, 8:24pm

I think the problem here is a bug in Temporal and the fact that the dns:/// prefix is not supported (anymore). See also Unable to use dns:// prefix for clusterInformation.rpcAddress configuration · Issue #5979 · temporalio/temporal · GitHub.

The documentation says that dns:/// can be used and it would be the only way to force grpc-go to use the DNS resolver, but given the current code and the use of net.SplitHostPort I don’t think this is possible.

Topic		Replies	Views
History server context deadline exceed errors every hour Community Support history	4	1015	October 27, 2022
Temporal Sever errors ; workflow failures and all request to history client failed Community Support	6	1075	May 8, 2023
Temporal History Server Errors Community Support history	10	1138	September 4, 2024
Error: Failed to get history on workflow id with Error Details: context deadline exceeded Community Support go-sdk	1	901	October 17, 2021
Errors on Temporal History Server Community Support history	3	509	July 4, 2023

Failed to retrieve replication message and Failed to get replication tasks from client errors

Related topics