Errors in frontend and hisotry

Thanks @tihomir

Between i see my replication taks and history node has too much data check this out

SELECT
    ->   TABLE_NAME AS `Table`,
    ->   ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024) AS `Size (MB)`
    -> FROM
    ->   information_schema.TABLES
    -> WHERE
    ->   TABLE_SCHEMA = "temporal"
    -> ORDER BY
    ->   (DATA_LENGTH + INDEX_LENGTH)
    -> DESC;
+---------------------------+-----------+
| Table                     | Size (MB) |
+---------------------------+-----------+
| replication_tasks         |      3354 |
| history_node              |      3004 |
| history_tree              |        40 |
| executions                |        21 |
| timer_tasks               |        12 |
| activity_info_maps        |        10 |
| replication_tasks_dlq     |         8 |
| signals_requested_sets    |         6 |
| timer_info_maps           |         5 |
| current_executions        |         1 |
| cluster_membership        |         0 |
| shards                    |         0 |
| task_queues               |         0 |
| buffered_events           |         0 |
| namespaces                |         0 |
| schema_update_history     |         0 |
| transfer_tasks            |         0 |
| request_cancel_info_maps  |         0 |
| cluster_metadata_info     |         0 |
| cluster_metadata          |         0 |
| tiered_storage_tasks      |         0 |
| queue_metadata            |         0 |
| child_execution_info_maps |         0 |
| tasks                     |         0 |
| queue                     |         0 |
| visibility_tasks          |         0 |
| namespace_metadata        |         0 |
| signal_info_maps          |         0 |
| schema_version            |         0 |
+---------------------------+-----------+
29 rows in set (0.049 sec)

replication_tasks stores replication tasks which is used for cross dc replication. If your namespace is configured to be replicate to multiple clusters, but the remote cluster never come to fetch the replication tasks, then the replication tasks will not be cleanup.
history_node stores all your workflows’ history data. Depending on your workflow history size and their retention time, it is possible that the could accumulate that much of data.

The thing is i see the replication happening and my remote has exact copy of whats available in primary… thats what make me scratch my head!

Between, if i keep swithching between primary and standby… and at some point if both has a copy of workflow , and upon receiving more signals/events… can the replication stop and get queued up?

could you do tctl --ns your_namespace n desc to see how many clusters are configured for your namespace? Is it possible that you have more than 2 clusters configured?

Usually, the replication_tasks should be small as it only stores metadata. Could you check if the metrics has any datapoint?
Operation: ReplicationTaskCleanup name: replication_task_cleanup_count

Secondly, Could you pick a shard and run tctl admin shard describe --shard_id?

this shows as

tctl admin shard describe --shard_id 278
{
  "shardId": 278,
  "rangeId": "317",
  "owner": "10.xx.yyy.zz:7234",
  "transferAckLevel": "332398619",
  "updateTime": "2022-03-16T17:37:10.514845009Z",
  "timerAckLevelTime": "2022-03-16T16:26:12.390669141Z",
  "namespaceNotificationVersion": "16",
  "clusterTransferAckLevel": {
    "active": "332398619",
    "standby": "332398619"
  },
  "clusterTimerAckLevel": {
    "active": "2022-03-16T16:26:12.390669141Z",
    "standby": "2022-03-16T16:26:12.390669141Z"
  },
  "clusterReplicationLevel": {
    "standby": "332398591"
  },
  "visibilityAckLevel": "324010108"
}

 tctl admin shard describe --shard_id 128
{
  "shardId": 128,
  "rangeId": "327",
  "owner": "10.xx.yy.zz:7234",
  "transferAckLevel": "342884429",
  "updateTime": "2022-03-16T17:37:53.573187203Z",
  "timerAckLevelTime": "2022-03-16T16:32:39.752014201Z",
  "namespaceNotificationVersion": "16",
  "clusterTransferAckLevel": {
    "active": "342884429",
    "standby": "342884429"
  },
  "clusterTimerAckLevel": {
    "active": "2022-03-16T16:32:39.752014201Z",
    "standby": "2022-03-16T16:32:39.752014201Z"
  },
  "clusterReplicationLevel": {
    "standby": "342884351"
  },
  "visibilityAckLevel": "336593049"
}



current cleanup count is 936.

|replication_task_cleanup_count{app_kubernetes_io_component="history",app_kubernetes_io_instance="xxxal",app_kubernetes_io_managed_by="Helm",app_kubernetes_io_name="temporal",app_kubernetes_io_part_of="temporal",app_kubernetes_io_version="1.14.0",helm_sh_chart="temporal-0.14.0",instance="10.xx.yyy.zz:9090",job="kubernetes-pods",kubernetes_namespace="temporal",kubernetes_pod_name="my-temporal-history-969f44c48-fsxnd",namespace="all",operation="ReplicationTaskCleanup",pod_template_hash="969f44c48",type="history"}|936|

historically too i see cleanup count in order of 400 to 1500

tctl --ns myns-dev  n desc
Name: myns-dev
Id: ff24a0e9-f4eb-4590-8c52-0c560dc87155
Description: my temporal namespace
OwnerEmail: me@mydomain.com
NamespaceData: map[string]string(nil)
State: Registered
Retention: 24h0m0s
ActiveClusterName: active
Clusters: active, standby
HistoryArchivalState: Enabled
HistoryArchivalURI: s3://mytemporalbuckets/myenvironment
VisibilityArchivalState: Enabled
VisibilityArchivalURI: s3://mytemporalbuckets/myenvironment
Bad binaries to reset:
+-----------------+----------+------------+--------+
| BINARY CHECKSUM | OPERATOR | START TIME | REASON |
+-----------------+----------+------------+--------+
+-----------------+----------+------------+--------+

i just tried cleaning up my standby cluster and allowed to replicate data afresh…
i see the replication is too slow (how ever i do not see any connection drop/network issues etc)…
this time around only 3 of my workflows got replicated to standby cluster…

i event tried sending some fake signals to active cluster to force replicate but that too did not help
refer here

hi @Yimin_Chen @yux , any suggestions based on the details i have provided?

could it be related to Adding a cluster using dns fails - #2 by yux , there dns:/// is breaking and hence no replication happening and over a period replication db queue/table grows and causes db cpu to go high??

@yux @tihomir any suggestions?

@madhu I don’t at this point but will ping the guys and get back to you with more into.

Given that your DB’s CPU usage is consistently above 99%, that could be the root cause of all kinds of weird issues. How many historyShards do you have? And how many temporal history service instances do you have? Is it possible to scale up your DB to have more CPU available?

Not an expert on RDS. Have you tried to debug the cpu issue? Troubleshoot high CPU usage on Amazon RDS for MySQL

as i said, this is not happening normally, if i have my cluster as standalone, the cpu usage is hardly 8%
if i configure replication using dns:/// cpu goes 99%,
if i configure replication using ip directly the replication happens and cpu usage is back to 8 or 9%
i have no reason to belive it is RDS issue.

I see the replication table clogged only when i have my replications configured using dns:///
and the same set of configurations where working just fine with 1.14.x…

When using dns:///, the replication stops. Is my understanding correct?

Sorry for delayed reply, i was on vacxation :slight_smile:

Yes with dns:/// the clusters are not able to resolve each other and replication stops.
I tried without dns:/// that too did not help, using ip address i am able to replicate.
dns:/// was working in 1.4.x

things started working after i cleaned up cluster metadata and cluster metadata info table, and also after promoting my namespace to global namespace again

refer Adding a cluster using dns fails - #10 by madhu