Temporal not writing to cassandra?

Zigmas_Satkevicius · February 21, 2022, 10:20pm

Hello,

We had a weird situation considering tempora/cassandra today. I was hoping you guys could point me into a direction I would be able to look for a cause. My own guess is it’s somehow related to Cassandra, but I’m not sure, I can connect to it and it seems operaional.

The information I was able to collect:

We had two incidents, the first one recovered without anyone noticing it, the second one resulted in us disabling temporal and switching back to our old system.

I will be showing info from the second incident since they seem similar.
During these incidents we had a lot of errors, but all of these we’re familiar with. I always assumed they are connected to shard rebalancing.

Mostly from history service.

processor pump failed with error
Error updating ack level for shard
Error updating timer ack level for shard

What an average workflow looked like during the incident:

Cassandra network in/out dropped to baseline levels ~30 minutes before the incident:

Our rpc calls during the incident:

Our persistence latencies during the incident:

Our history service CPU and memory during the incident:

Our cluster information:

{
  "supportedClients": {
    "temporal-cli": "\u003c2.0.0",
    "temporal-go": "\u003c2.0.0",
    "temporal-java": "\u003c2.0.0",
    "temporal-php": "\u003c2.0.0",
    "temporal-server": "\u003c2.0.0",
    "temporal-typescript": "\u003c2.0.0"
  },
  "serverVersion": "1.14.0",
  "membershipInfo": {
    "currentHost": {
      "identity": "172.20.102.53:7233"
    },
    "reachableMembers": [
      "172.20.93.245:6934",
      "172.20.102.53:6933",
      "172.20.83.145:6934",
      "172.20.53.107:6934",
      "172.20.121.25:6934",
      "172.20.39.233:6934",
      "172.20.99.131:6935",
      "172.20.100.35:6939",
      "172.20.108.131:6934",
      "172.20.33.38:6933",
      "172.20.73.81:6939",
      "172.20.94.48:6934",
      "172.20.43.30:6935",
      "172.20.39.234:6934"
    ],
    "rings": [
      {
        "role": "frontend",
        "memberCount": 2,
        "members": [
          {
            "identity": "172.20.33.38:7233"
          },
          {
            "identity": "172.20.102.53:7233"
          }
        ]
      },
      {
        "role": "history",
        "memberCount": 8,
        "members": [
          {
            "identity": "172.20.108.131:7234"
          },
          {
            "identity": "172.20.93.245:7234"
          },
          {
            "identity": "172.20.121.25:7234"
          },
          {
            "identity": "172.20.83.145:7234"
          },
          {
            "identity": "172.20.53.107:7234"
          },
          {
            "identity": "172.20.39.234:7234"
          },
          {
            "identity": "172.20.39.233:7234"
          },
          {
            "identity": "172.20.94.48:7234"
          }
        ]
      },
      {
        "role": "matching",
        "memberCount": 2,
        "members": [
          {
            "identity": "172.20.99.131:7235"
          },
          {
            "identity": "172.20.43.30:7235"
          }
        ]
      },
      {
        "role": "worker",
        "memberCount": 2,
        "members": [
          {
            "identity": "172.20.73.81:7239"
          },
          {
            "identity": "172.20.100.35:7239"
          }
        ]
      }
    ]
  },
  "clusterId": "f2c3c77c-9e5b-4bdb-8ef7-38557729adcc",
  "clusterName": "active",
  "historyShardCount": 12288,
  "persistenceStore": "cassandra",
  "visibilityStore": "elasticsearch",
  "failoverVersionIncrement": "10",
  "initialFailoverVersion": "1"
}

Any ideas? Trying to figure this out. I’m not expecting a straight answer, just a general direction I should look into.

tihomir · February 23, 2022, 11:13pm

Can you check and report errors returned from your DB (could be wrapped errors that contain errors returned from DB).

Looking at the persistence latencies board you posted it seems very high, best guess without seeing some more concrete errors is that your DB is/was having issues during the incident.

Zigmas_Satkevicius · March 2, 2022, 6:52pm

We diagnosed it as network issues in AWS cluster.

Topic		Replies	Views
Operation updateShard encounter timeout Community Support history	9	955	June 28, 2022
Errors in temporal history and matching service logs Community Support cassandra , deployment	2	1220	July 7, 2022
Important announcement for Temporal-Cassandra users Announcements	2	1192	August 24, 2023
Temporal in idle state generating huge read/write traffic to Cassandra Community Support java-sdk	26	1512	July 21, 2021
Errors on Temporal History Server Community Support history	3	577	July 4, 2023

Temporal not writing to cassandra?

Related topics