Errors in frontend and hisotry

i see my frontend filled with these logs

{"level":"error","ts":"2022-03-14T07:20:31.078Z","msg":"history client encountered error","service":"frontend","error":"sql: no rows in result set","service-error-type":"serviceerror.NotFound","logging-call-at":"metricClient.go:620","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/common/log.(*throttledLogger).Error.func1\n\t/temporal/common/log/throttle_logger.go:79\ngo.temporal.io/server/common/log.(*throttledLogger).rateLimit\n\t/temporal/common/log/throttle_logger.go:100\ngo.temporal.io/server/common/log.(*throttledLogger).Error\n\t/temporal/common/log/throttle_logger.go:78\ngo.temporal.io/server/client/history.(*metricClient).finishMetricsRecording\n\t/temporal/client/history/metricClient.go:620\ngo.temporal.io/server/client/history.(*metricClient).DescribeWorkflowExecution.func1\n\t/temporal/client/history/metricClient.go:176\ngo.temporal.io/server/client/history.(*metricClient).DescribeWorkflowExecution\n\t/temporal/client/history/metricClient.go:179\ngo.temporal.io/server/client/history.(*retryableClient).DescribeWorkflowExecution.func1\n\t/temporal/client/history/retryableClient.go:205\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/client/history.(*retryableClient).DescribeWorkflowExecution\n\t/temporal/client/history/retryableClient.go:209\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).DescribeWorkflowExecution\n\t/temporal/service/frontend/workflowHandler.go:2693\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).DescribeWorkflowExecution.func2\n\t/temporal/service/frontend/dcRedirectionHandler.go:255\ngo.temporal.io/server/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect\n\t/temporal/service/frontend/dcRedirectionPolicy.go:125\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).DescribeWorkflowExecution\n\t/temporal/service/frontend/dcRedirectionHandler.go:251\ngo.temporal.io/api/workflowservice/v1._WorkflowService_DescribeWorkflowExecution_Handler.func1\n\t/go/pkg/mod/go.temporal.io/api@v1.7.1-0.20220131203817-08fe71b1361d/workflowservice/v1/service.pb.go:1643\ngo.temporal.io/server/common/rpc/interceptor.(*SDKVersionInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/sdk_version.go:63\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1116\ngo.temporal.io/server/common/authorization.(*interceptor).Interceptor\n\t/temporal/common/authorization/interceptor.go:152\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceCountLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_count_limit.go:99\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceRateLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_rate_limit.go:89\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/rate_limit.go:84\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceValidatorInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_validator.go:113\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/telemetry.go:108\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:131\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceLogInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_logger.go:84\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1121\ngo.temporal.io/api/workflowservice/v1._WorkflowService_DescribeWorkflowExecution_Handler\n\t/go/pkg/mod/go.temporal.io/api@v1.7.1-0.20220131203817-08fe71b1361d/workflowservice/v1/service.pb.go:1645\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921"}

In my worker i see


{"level":"error","ts":"2022-03-14T07:25:52.265Z","msg":"Failed to retrieve replication messages.","shard-id":421,"address":"10.aaa.xxx.yyy:7234","component":"history-engine","error":"context deadline exceeded","logging-call-at":"historyEngine.go:3000","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*historyEngineImpl).GetReplicationMessages\n\t/temporal/service/history/historyEngine.go:3000\ngo.temporal.io/server/service/history.(*Handler).GetReplicationMessages.func1\n\t/temporal/service/history/handler.go:1212"}
{"level":"warn","ts":"2022-03-14T07:25:52.265Z","msg":"Failed to get replication tasks for shard","service":"history","error":"context deadline exceeded","logging-call-at":"handler.go:1220"}

I also see that my Database’s cpu usage is consistently 99% and above.

Hi @madhu, has this issue been helped via slack?

Thi Tihomir,
No i am still struggling with this, no reply in slack channel too.

Just a wild theory ::
Could it be some rouge worflow(s) which has some deep history?
Loading it could be taking high cpu usage, and since it might be big, the replication could timeout…
and could temporal history keep trying the same thing again and again?

Is there any tctl comamnd /metric/ db query to get hold of workflow /history size for all workflows?

From RDS performance insight i see the below queriy taking up most of the cpu ( about 1.5+ core of the 2 cores).

SET transaction_isolation='READ-COMMITTED'

not sure when this get called though!

And if i switch of my xdc replication the cpu usage comes down to about 8%

Is there any tctl comamnd /metric/ db query to get hold of workflow /history size for all workflows?

Will check if there is a metric. History size is stored in visibility via the
HistoryLength default search attribute. With that you could in web ui “Advanced Search” run a query such as:
HistoryLength > X
or do same thing via SDK api ListWorkflowExecutions
or tctl wf list and provide a query.

Tagging @yux as well if that’s ok :slight_smile:

Thanks @tihomir

Between i see my replication taks and history node has too much data check this out

SELECT
    ->   TABLE_NAME AS `Table`,
    ->   ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024) AS `Size (MB)`
    -> FROM
    ->   information_schema.TABLES
    -> WHERE
    ->   TABLE_SCHEMA = "temporal"
    -> ORDER BY
    ->   (DATA_LENGTH + INDEX_LENGTH)
    -> DESC;
+---------------------------+-----------+
| Table                     | Size (MB) |
+---------------------------+-----------+
| replication_tasks         |      3354 |
| history_node              |      3004 |
| history_tree              |        40 |
| executions                |        21 |
| timer_tasks               |        12 |
| activity_info_maps        |        10 |
| replication_tasks_dlq     |         8 |
| signals_requested_sets    |         6 |
| timer_info_maps           |         5 |
| current_executions        |         1 |
| cluster_membership        |         0 |
| shards                    |         0 |
| task_queues               |         0 |
| buffered_events           |         0 |
| namespaces                |         0 |
| schema_update_history     |         0 |
| transfer_tasks            |         0 |
| request_cancel_info_maps  |         0 |
| cluster_metadata_info     |         0 |
| cluster_metadata          |         0 |
| tiered_storage_tasks      |         0 |
| queue_metadata            |         0 |
| child_execution_info_maps |         0 |
| tasks                     |         0 |
| queue                     |         0 |
| visibility_tasks          |         0 |
| namespace_metadata        |         0 |
| signal_info_maps          |         0 |
| schema_version            |         0 |
+---------------------------+-----------+
29 rows in set (0.049 sec)

replication_tasks stores replication tasks which is used for cross dc replication. If your namespace is configured to be replicate to multiple clusters, but the remote cluster never come to fetch the replication tasks, then the replication tasks will not be cleanup.
history_node stores all your workflows’ history data. Depending on your workflow history size and their retention time, it is possible that the could accumulate that much of data.

The thing is i see the replication happening and my remote has exact copy of whats available in primary… thats what make me scratch my head!

Between, if i keep swithching between primary and standby… and at some point if both has a copy of workflow , and upon receiving more signals/events… can the replication stop and get queued up?

could you do tctl --ns your_namespace n desc to see how many clusters are configured for your namespace? Is it possible that you have more than 2 clusters configured?

Usually, the replication_tasks should be small as it only stores metadata. Could you check if the metrics has any datapoint?
Operation: ReplicationTaskCleanup name: replication_task_cleanup_count

Secondly, Could you pick a shard and run tctl admin shard describe --shard_id?

this shows as

tctl admin shard describe --shard_id 278
{
  "shardId": 278,
  "rangeId": "317",
  "owner": "10.xx.yyy.zz:7234",
  "transferAckLevel": "332398619",
  "updateTime": "2022-03-16T17:37:10.514845009Z",
  "timerAckLevelTime": "2022-03-16T16:26:12.390669141Z",
  "namespaceNotificationVersion": "16",
  "clusterTransferAckLevel": {
    "active": "332398619",
    "standby": "332398619"
  },
  "clusterTimerAckLevel": {
    "active": "2022-03-16T16:26:12.390669141Z",
    "standby": "2022-03-16T16:26:12.390669141Z"
  },
  "clusterReplicationLevel": {
    "standby": "332398591"
  },
  "visibilityAckLevel": "324010108"
}

 tctl admin shard describe --shard_id 128
{
  "shardId": 128,
  "rangeId": "327",
  "owner": "10.xx.yy.zz:7234",
  "transferAckLevel": "342884429",
  "updateTime": "2022-03-16T17:37:53.573187203Z",
  "timerAckLevelTime": "2022-03-16T16:32:39.752014201Z",
  "namespaceNotificationVersion": "16",
  "clusterTransferAckLevel": {
    "active": "342884429",
    "standby": "342884429"
  },
  "clusterTimerAckLevel": {
    "active": "2022-03-16T16:32:39.752014201Z",
    "standby": "2022-03-16T16:32:39.752014201Z"
  },
  "clusterReplicationLevel": {
    "standby": "342884351"
  },
  "visibilityAckLevel": "336593049"
}



current cleanup count is 936.

|replication_task_cleanup_count{app_kubernetes_io_component="history",app_kubernetes_io_instance="xxxal",app_kubernetes_io_managed_by="Helm",app_kubernetes_io_name="temporal",app_kubernetes_io_part_of="temporal",app_kubernetes_io_version="1.14.0",helm_sh_chart="temporal-0.14.0",instance="10.xx.yyy.zz:9090",job="kubernetes-pods",kubernetes_namespace="temporal",kubernetes_pod_name="my-temporal-history-969f44c48-fsxnd",namespace="all",operation="ReplicationTaskCleanup",pod_template_hash="969f44c48",type="history"}|936|

historically too i see cleanup count in order of 400 to 1500

tctl --ns myns-dev  n desc
Name: myns-dev
Id: ff24a0e9-f4eb-4590-8c52-0c560dc87155
Description: my temporal namespace
OwnerEmail: me@mydomain.com
NamespaceData: map[string]string(nil)
State: Registered
Retention: 24h0m0s
ActiveClusterName: active
Clusters: active, standby
HistoryArchivalState: Enabled
HistoryArchivalURI: s3://mytemporalbuckets/myenvironment
VisibilityArchivalState: Enabled
VisibilityArchivalURI: s3://mytemporalbuckets/myenvironment
Bad binaries to reset:
+-----------------+----------+------------+--------+
| BINARY CHECKSUM | OPERATOR | START TIME | REASON |
+-----------------+----------+------------+--------+
+-----------------+----------+------------+--------+

i just tried cleaning up my standby cluster and allowed to replicate data afresh…
i see the replication is too slow (how ever i do not see any connection drop/network issues etc)…
this time around only 3 of my workflows got replicated to standby cluster…

i event tried sending some fake signals to active cluster to force replicate but that too did not help
refer here

hi @Yimin_Chen @yux , any suggestions based on the details i have provided?

could it be related to Adding a cluster using dns fails - #2 by yux , there dns:/// is breaking and hence no replication happening and over a period replication db queue/table grows and causes db cpu to go high??

@yux @tihomir any suggestions?