Errors in frontend and hisotry

madhu · March 14, 2022, 7:27am

i see my frontend filled with these logs

{"level":"error","ts":"2022-03-14T07:20:31.078Z","msg":"history client encountered error","service":"frontend","error":"sql: no rows in result set","service-error-type":"serviceerror.NotFound","logging-call-at":"metricClient.go:620","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/common/log.(*throttledLogger).Error.func1\n\t/temporal/common/log/throttle_logger.go:79\ngo.temporal.io/server/common/log.(*throttledLogger).rateLimit\n\t/temporal/common/log/throttle_logger.go:100\ngo.temporal.io/server/common/log.(*throttledLogger).Error\n\t/temporal/common/log/throttle_logger.go:78\ngo.temporal.io/server/client/history.(*metricClient).finishMetricsRecording\n\t/temporal/client/history/metricClient.go:620\ngo.temporal.io/server/client/history.(*metricClient).DescribeWorkflowExecution.func1\n\t/temporal/client/history/metricClient.go:176\ngo.temporal.io/server/client/history.(*metricClient).DescribeWorkflowExecution\n\t/temporal/client/history/metricClient.go:179\ngo.temporal.io/server/client/history.(*retryableClient).DescribeWorkflowExecution.func1\n\t/temporal/client/history/retryableClient.go:205\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/client/history.(*retryableClient).DescribeWorkflowExecution\n\t/temporal/client/history/retryableClient.go:209\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).DescribeWorkflowExecution\n\t/temporal/service/frontend/workflowHandler.go:2693\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).DescribeWorkflowExecution.func2\n\t/temporal/service/frontend/dcRedirectionHandler.go:255\ngo.temporal.io/server/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect\n\t/temporal/service/frontend/dcRedirectionPolicy.go:125\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).DescribeWorkflowExecution\n\t/temporal/service/frontend/dcRedirectionHandler.go:251\ngo.temporal.io/api/workflowservice/v1._WorkflowService_DescribeWorkflowExecution_Handler.func1\n\t/go/pkg/mod/go.temporal.io/api@v1.7.1-0.20220131203817-08fe71b1361d/workflowservice/v1/service.pb.go:1643\ngo.temporal.io/server/common/rpc/interceptor.(*SDKVersionInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/sdk_version.go:63\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1116\ngo.temporal.io/server/common/authorization.(*interceptor).Interceptor\n\t/temporal/common/authorization/interceptor.go:152\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceCountLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_count_limit.go:99\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceRateLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_rate_limit.go:89\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/rate_limit.go:84\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceValidatorInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_validator.go:113\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/telemetry.go:108\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:131\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceLogInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_logger.go:84\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1119\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1121\ngo.temporal.io/api/workflowservice/v1._WorkflowService_DescribeWorkflowExecution_Handler\n\t/go/pkg/mod/go.temporal.io/api@v1.7.1-0.20220131203817-08fe71b1361d/workflowservice/v1/service.pb.go:1645\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921"}

In my worker i see


{"level":"error","ts":"2022-03-14T07:25:52.265Z","msg":"Failed to retrieve replication messages.","shard-id":421,"address":"10.aaa.xxx.yyy:7234","component":"history-engine","error":"context deadline exceeded","logging-call-at":"historyEngine.go:3000","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*historyEngineImpl).GetReplicationMessages\n\t/temporal/service/history/historyEngine.go:3000\ngo.temporal.io/server/service/history.(*Handler).GetReplicationMessages.func1\n\t/temporal/service/history/handler.go:1212"}
{"level":"warn","ts":"2022-03-14T07:25:52.265Z","msg":"Failed to get replication tasks for shard","service":"history","error":"context deadline exceeded","logging-call-at":"handler.go:1220"}

I also see that my Database’s cpu usage is consistently 99% and above.

tihomir · March 16, 2022, 3:54am

Hi @madhu, has this issue been helped via slack?

madhu · March 16, 2022, 6:21am

Thi Tihomir,
No i am still struggling with this, no reply in slack channel too.

madhu · March 16, 2022, 8:12am

Just a wild theory ::
Could it be some rouge worflow(s) which has some deep history?
Loading it could be taking high cpu usage, and since it might be big, the replication could timeout…
and could temporal history keep trying the same thing again and again?

Is there any tctl comamnd /metric/ db query to get hold of workflow /history size for all workflows?

madhu · March 16, 2022, 8:36am

From RDS performance insight i see the below queriy taking up most of the cpu ( about 1.5+ core of the 2 cores).

SET transaction_isolation='READ-COMMITTED'

not sure when this get called though!

And if i switch of my xdc replication the cpu usage comes down to about 8%

tihomir · March 16, 2022, 2:05pm

Is there any tctl comamnd /metric/ db query to get hold of workflow /history size for all workflows?

Will check if there is a metric. History size is stored in visibility via the
HistoryLength default search attribute. With that you could in web ui “Advanced Search” run a query such as:
HistoryLength > X
or do same thing via SDK api ListWorkflowExecutions
or tctl wf list and provide a query.

tihomir · March 16, 2022, 2:07pm

Tagging @yux as well if that’s ok

madhu · March 16, 2022, 2:36pm

Thanks @tihomir

Between i see my replication taks and history node has too much data check this out

SELECT
    ->   TABLE_NAME AS `Table`,
    ->   ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024) AS `Size (MB)`
    -> FROM
    ->   information_schema.TABLES
    -> WHERE
    ->   TABLE_SCHEMA = "temporal"
    -> ORDER BY
    ->   (DATA_LENGTH + INDEX_LENGTH)
    -> DESC;
+---------------------------+-----------+
| Table                     | Size (MB) |
+---------------------------+-----------+
| replication_tasks         |      3354 |
| history_node              |      3004 |
| history_tree              |        40 |
| executions                |        21 |
| timer_tasks               |        12 |
| activity_info_maps        |        10 |
| replication_tasks_dlq     |         8 |
| signals_requested_sets    |         6 |
| timer_info_maps           |         5 |
| current_executions        |         1 |
| cluster_membership        |         0 |
| shards                    |         0 |
| task_queues               |         0 |
| buffered_events           |         0 |
| namespaces                |         0 |
| schema_update_history     |         0 |
| transfer_tasks            |         0 |
| request_cancel_info_maps  |         0 |
| cluster_metadata_info     |         0 |
| cluster_metadata          |         0 |
| tiered_storage_tasks      |         0 |
| queue_metadata            |         0 |
| child_execution_info_maps |         0 |
| tasks                     |         0 |
| queue                     |         0 |
| visibility_tasks          |         0 |
| namespace_metadata        |         0 |
| signal_info_maps          |         0 |
| schema_version            |         0 |
+---------------------------+-----------+
29 rows in set (0.049 sec)

Yimin_Chen · March 16, 2022, 4:17pm

replication_tasks stores replication tasks which is used for cross dc replication. If your namespace is configured to be replicate to multiple clusters, but the remote cluster never come to fetch the replication tasks, then the replication tasks will not be cleanup.
history_node stores all your workflows’ history data. Depending on your workflow history size and their retention time, it is possible that the could accumulate that much of data.

madhu · March 16, 2022, 4:20pm

The thing is i see the replication happening and my remote has exact copy of whats available in primary… thats what make me scratch my head!

Between, if i keep swithching between primary and standby… and at some point if both has a copy of workflow , and upon receiving more signals/events… can the replication stop and get queued up?

Yimin_Chen · March 16, 2022, 4:23pm

could you do tctl --ns your_namespace n desc to see how many clusters are configured for your namespace? Is it possible that you have more than 2 clusters configured?

yux · March 16, 2022, 4:53pm

Usually, the replication_tasks should be small as it only stores metadata. Could you check if the metrics has any datapoint?
Operation: ReplicationTaskCleanup name: replication_task_cleanup_count

Secondly, Could you pick a shard and run tctl admin shard describe --shard_id?

madhu · March 16, 2022, 5:42pm

this shows as

tctl admin shard describe --shard_id 278
{
  "shardId": 278,
  "rangeId": "317",
  "owner": "10.xx.yyy.zz:7234",
  "transferAckLevel": "332398619",
  "updateTime": "2022-03-16T17:37:10.514845009Z",
  "timerAckLevelTime": "2022-03-16T16:26:12.390669141Z",
  "namespaceNotificationVersion": "16",
  "clusterTransferAckLevel": {
    "active": "332398619",
    "standby": "332398619"
  },
  "clusterTimerAckLevel": {
    "active": "2022-03-16T16:26:12.390669141Z",
    "standby": "2022-03-16T16:26:12.390669141Z"
  },
  "clusterReplicationLevel": {
    "standby": "332398591"
  },
  "visibilityAckLevel": "324010108"
}

 tctl admin shard describe --shard_id 128
{
  "shardId": 128,
  "rangeId": "327",
  "owner": "10.xx.yy.zz:7234",
  "transferAckLevel": "342884429",
  "updateTime": "2022-03-16T17:37:53.573187203Z",
  "timerAckLevelTime": "2022-03-16T16:32:39.752014201Z",
  "namespaceNotificationVersion": "16",
  "clusterTransferAckLevel": {
    "active": "342884429",
    "standby": "342884429"
  },
  "clusterTimerAckLevel": {
    "active": "2022-03-16T16:32:39.752014201Z",
    "standby": "2022-03-16T16:32:39.752014201Z"
  },
  "clusterReplicationLevel": {
    "standby": "342884351"
  },
  "visibilityAckLevel": "336593049"
}

madhu · March 16, 2022, 5:49pm

current cleanup count is 936.

|replication_task_cleanup_count{app_kubernetes_io_component="history",app_kubernetes_io_instance="xxxal",app_kubernetes_io_managed_by="Helm",app_kubernetes_io_name="temporal",app_kubernetes_io_part_of="temporal",app_kubernetes_io_version="1.14.0",helm_sh_chart="temporal-0.14.0",instance="10.xx.yyy.zz:9090",job="kubernetes-pods",kubernetes_namespace="temporal",kubernetes_pod_name="my-temporal-history-969f44c48-fsxnd",namespace="all",operation="ReplicationTaskCleanup",pod_template_hash="969f44c48",type="history"}|936|

madhu · March 16, 2022, 5:53pm

historically too i see cleanup count in order of 400 to 1500

madhu · March 16, 2022, 5:56pm

tctl --ns myns-dev  n desc
Name: myns-dev
Id: ff24a0e9-f4eb-4590-8c52-0c560dc87155
Description: my temporal namespace
OwnerEmail: me@mydomain.com
NamespaceData: map[string]string(nil)
State: Registered
Retention: 24h0m0s
ActiveClusterName: active
Clusters: active, standby
HistoryArchivalState: Enabled
HistoryArchivalURI: s3://mytemporalbuckets/myenvironment
VisibilityArchivalState: Enabled
VisibilityArchivalURI: s3://mytemporalbuckets/myenvironment
Bad binaries to reset:
+-----------------+----------+------------+--------+
| BINARY CHECKSUM | OPERATOR | START TIME | REASON |
+-----------------+----------+------------+--------+
+-----------------+----------+------------+--------+

madhu · March 16, 2022, 6:03pm

i just tried cleaning up my standby cluster and allowed to replicate data afresh…
i see the replication is too slow (how ever i do not see any connection drop/network issues etc)…
this time around only 3 of my workflows got replicated to standby cluster…

i event tried sending some fake signals to active cluster to force replicate but that too did not help
refer here

madhu · March 17, 2022, 12:21pm

hi @Yimin_Chen @yux , any suggestions based on the details i have provided?

madhu · March 17, 2022, 4:26pm

could it be related to Adding a cluster using dns fails - #2 by yux , there dns:/// is breaking and hence no replication happening and over a period replication db queue/table grows and causes db cpu to go high??

madhu · March 20, 2022, 6:55am

@yux @tihomir any suggestions?

Topic		Replies	Views
Temporal Sever errors ; workflow failures and all request to history client failed Community Support	6	1189	May 8, 2023
Consistent errors from matching service Community Support general-impl , error-handling	2	1222	September 9, 2023
Query around Error metrics for Temporal's internal service(s) Community Support metrics	1	657	February 20, 2023
History service errors Community Support general-impl	1	384	June 23, 2023
Front end errors, temporal server unstable during the error window Community Support frontend	1	716	March 2, 2023

Errors in frontend and hisotry

Related topics