Temporal not writing to cassandra?

Hello,

We had a weird situation considering tempora/cassandra today. I was hoping you guys could point me into a direction I would be able to look for a cause. My own guess is it’s somehow related to Cassandra, but I’m not sure, I can connect to it and it seems operaional.

The information I was able to collect:

We had two incidents, the first one recovered without anyone noticing it, the second one resulted in us disabling temporal and switching back to our old system.

I will be showing info from the second incident since they seem similar.
During these incidents we had a lot of errors, but all of these we’re familiar with. I always assumed they are connected to shard rebalancing.

Mostly from history service.

  • processor pump failed with error
  • Error updating ack level for shard
  • Error updating timer ack level for shard

What an average workflow looked like during the incident:

Cassandra network in/out dropped to baseline levels ~30 minutes before the incident:

Our rpc calls during the incident:

Our persistence latencies during the incident:

Our history service CPU and memory during the incident:

Our cluster information:

{
  "supportedClients": {
    "temporal-cli": "\u003c2.0.0",
    "temporal-go": "\u003c2.0.0",
    "temporal-java": "\u003c2.0.0",
    "temporal-php": "\u003c2.0.0",
    "temporal-server": "\u003c2.0.0",
    "temporal-typescript": "\u003c2.0.0"
  },
  "serverVersion": "1.14.0",
  "membershipInfo": {
    "currentHost": {
      "identity": "172.20.102.53:7233"
    },
    "reachableMembers": [
      "172.20.93.245:6934",
      "172.20.102.53:6933",
      "172.20.83.145:6934",
      "172.20.53.107:6934",
      "172.20.121.25:6934",
      "172.20.39.233:6934",
      "172.20.99.131:6935",
      "172.20.100.35:6939",
      "172.20.108.131:6934",
      "172.20.33.38:6933",
      "172.20.73.81:6939",
      "172.20.94.48:6934",
      "172.20.43.30:6935",
      "172.20.39.234:6934"
    ],
    "rings": [
      {
        "role": "frontend",
        "memberCount": 2,
        "members": [
          {
            "identity": "172.20.33.38:7233"
          },
          {
            "identity": "172.20.102.53:7233"
          }
        ]
      },
      {
        "role": "history",
        "memberCount": 8,
        "members": [
          {
            "identity": "172.20.108.131:7234"
          },
          {
            "identity": "172.20.93.245:7234"
          },
          {
            "identity": "172.20.121.25:7234"
          },
          {
            "identity": "172.20.83.145:7234"
          },
          {
            "identity": "172.20.53.107:7234"
          },
          {
            "identity": "172.20.39.234:7234"
          },
          {
            "identity": "172.20.39.233:7234"
          },
          {
            "identity": "172.20.94.48:7234"
          }
        ]
      },
      {
        "role": "matching",
        "memberCount": 2,
        "members": [
          {
            "identity": "172.20.99.131:7235"
          },
          {
            "identity": "172.20.43.30:7235"
          }
        ]
      },
      {
        "role": "worker",
        "memberCount": 2,
        "members": [
          {
            "identity": "172.20.73.81:7239"
          },
          {
            "identity": "172.20.100.35:7239"
          }
        ]
      }
    ]
  },
  "clusterId": "f2c3c77c-9e5b-4bdb-8ef7-38557729adcc",
  "clusterName": "active",
  "historyShardCount": 12288,
  "persistenceStore": "cassandra",
  "visibilityStore": "elasticsearch",
  "failoverVersionIncrement": "10",
  "initialFailoverVersion": "1"
}

Any ideas? Trying to figure this out. I’m not expecting a straight answer, just a general direction I should look into.

Can you check and report errors returned from your DB (could be wrapped errors that contain errors returned from DB).

Looking at the persistence latencies board you posted it seems very high, best guess without seeing some more concrete errors is that your DB is/was having issues during the incident.

We diagnosed it as network issues in AWS cluster.