{"level":"error","ts":"2022-03-14T07:20:31.078Z","msg":"history client encountered error","service":"frontend","error":"sql: no rows in result set","service-error-type":"serviceerror.NotFound","logging-call-at":"metricClient.go:620","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/common/log.(*throttledLogger).Error.func1\n\t/temporal/common/log/throttle_logger.go:79\ngo.temporal.io/server/common/log.(*throttledLogger).rateLimit\n\t/temporal/common/log/throttle_logger.go:100\ngo.temporal.io/server/common/log.(*throttledLogger).Error\n\t/temporal/common/log/throttle_logger.go:78\ngo.temporal.io/server/client/history.(*metricClient).finishMetricsRecording\n\t/temporal/client/history/metricClient.go:620\ngo.temporal.io/server/client/history.(*metricClient).DescribeWorkflowExecution.func1\n\t/temporal/client/history/metricClient.go:176\ngo.temporal.io/server/client/history.(*metricClient).DescribeWorkflowExecution\n\t/temporal/client/history/metricClient.go:179\ngo.temporal.io/server/client/history.(*retryableClient).DescribeWorkflowExecution.func1\n\t/temporal/client/history/retryableClient.go:205\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/client/history.(*retryableClient).DescribeWorkflowExecution\n\t/temporal/client/history/retryableClient.go:209\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).DescribeWorkflowExecution\n\t/temporal/service/frontend/workflowHandler.go:2693\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).DescribeWorkflowExecution.func2\n\t/temporal/service/frontend/dcRedirectionHandler.go:255\ngo.temporal.io/server/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect\n\t/temporal/service/frontend/dcRedirectionPolicy.go:125\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).DescribeWorkflowExecution\n\t/temporal/service/frontend/dcRedirectionHandler.go:251\ngo.temporal.io/api/workflowservice/v1._WorkflowService_DescribeWorkflowExecution_Handler.func1\n\t/go/pkg/mod/go.temporal.io/api@v1.7.1-0.20220131203817-08fe71b1361d/workflowservice/v1/service.pb.go:1643\ngo.temporal.io/server/common/rpc/interceptor.(*SDKVersionInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/sdk_version.go:63\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1116\ngo.temporal.io/server/common/authorization.(*interceptor).Interceptor\n\t/temporal/common/authorization/interceptor.go:152\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceCountLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_count_limit.go:99\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceRateLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_rate_limit.go:89\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/rate_limit.go:84\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceValidatorInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_validator.go:113\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/telemetry.go:108\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:131\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceLogInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_logger.go:84\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1121\ngo.temporal.io/api/workflowservice/v1._WorkflowService_DescribeWorkflowExecution_Handler\n\t/go/pkg/mod/go.temporal.io/api@v1.7.1-0.20220131203817-08fe71b1361d/workflowservice/v1/service.pb.go:1645\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921"}
In my worker i see
{"level":"error","ts":"2022-03-14T07:25:52.265Z","msg":"Failed to retrieve replication messages.","shard-id":421,"address":"10.aaa.xxx.yyy:7234","component":"history-engine","error":"context deadline exceeded","logging-call-at":"historyEngine.go:3000","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*historyEngineImpl).GetReplicationMessages\n\t/temporal/service/history/historyEngine.go:3000\ngo.temporal.io/server/service/history.(*Handler).GetReplicationMessages.func1\n\t/temporal/service/history/handler.go:1212"}
{"level":"warn","ts":"2022-03-14T07:25:52.265Z","msg":"Failed to get replication tasks for shard","service":"history","error":"context deadline exceeded","logging-call-at":"handler.go:1220"}
I also see that my Database’s cpu usage is consistently 99% and above.
Just a wild theory ::
Could it be some rouge worflow(s) which has some deep history?
Loading it could be taking high cpu usage, and since it might be big, the replication could timeout…
and could temporal history keep trying the same thing again and again?
Is there any tctl comamnd /metric/ db query to get hold of workflow /history size for all workflows?
Is there any tctl comamnd /metric/ db query to get hold of workflow /history size for all workflows?
Will check if there is a metric. History size is stored in visibility via the HistoryLength default search attribute. With that you could in web ui “Advanced Search” run a query such as: HistoryLength > X
or do same thing via SDK api ListWorkflowExecutions
or tctl wf list and provide a query.
replication_tasks stores replication tasks which is used for cross dc replication. If your namespace is configured to be replicate to multiple clusters, but the remote cluster never come to fetch the replication tasks, then the replication tasks will not be cleanup.
history_node stores all your workflows’ history data. Depending on your workflow history size and their retention time, it is possible that the could accumulate that much of data.
The thing is i see the replication happening and my remote has exact copy of whats available in primary… thats what make me scratch my head!
Between, if i keep swithching between primary and standby… and at some point if both has a copy of workflow , and upon receiving more signals/events… can the replication stop and get queued up?
could you do tctl --ns your_namespace n desc to see how many clusters are configured for your namespace? Is it possible that you have more than 2 clusters configured?
Usually, the replication_tasks should be small as it only stores metadata. Could you check if the metrics has any datapoint? Operation: ReplicationTaskCleanup name: replication_task_cleanup_count
Secondly, Could you pick a shard and run tctl admin shard describe --shard_id?
i just tried cleaning up my standby cluster and allowed to replicate data afresh…
i see the replication is too slow (how ever i do not see any connection drop/network issues etc)…
this time around only 3 of my workflows got replicated to standby cluster…
i event tried sending some fake signals to active cluster to force replicate but that too did not help refer here
could it be related to Adding a cluster using dns fails - #2 by yux , there dns:/// is breaking and hence no replication happening and over a period replication db queue/table grows and causes db cpu to go high??