Hello,
We had a weird situation considering tempora/cassandra today. I was hoping you guys could point me into a direction I would be able to look for a cause. My own guess is it’s somehow related to Cassandra, but I’m not sure, I can connect to it and it seems operaional.
The information I was able to collect:
We had two incidents, the first one recovered without anyone noticing it, the second one resulted in us disabling temporal and switching back to our old system.
I will be showing info from the second incident since they seem similar.
During these incidents we had a lot of errors, but all of these we’re familiar with. I always assumed they are connected to shard rebalancing.
Mostly from history service.
processor pump failed with error
- Error updating ack level for shard
Error updating timer ack level for shard
What an average workflow looked like during the incident:
Cassandra network in/out dropped to baseline levels ~30 minutes before the incident:
Our rpc calls during the incident:
Our persistence latencies during the incident:
Our history service CPU and memory during the incident:
Our cluster information:
{
"supportedClients": {
"temporal-cli": "\u003c2.0.0",
"temporal-go": "\u003c2.0.0",
"temporal-java": "\u003c2.0.0",
"temporal-php": "\u003c2.0.0",
"temporal-server": "\u003c2.0.0",
"temporal-typescript": "\u003c2.0.0"
},
"serverVersion": "1.14.0",
"membershipInfo": {
"currentHost": {
"identity": "172.20.102.53:7233"
},
"reachableMembers": [
"172.20.93.245:6934",
"172.20.102.53:6933",
"172.20.83.145:6934",
"172.20.53.107:6934",
"172.20.121.25:6934",
"172.20.39.233:6934",
"172.20.99.131:6935",
"172.20.100.35:6939",
"172.20.108.131:6934",
"172.20.33.38:6933",
"172.20.73.81:6939",
"172.20.94.48:6934",
"172.20.43.30:6935",
"172.20.39.234:6934"
],
"rings": [
{
"role": "frontend",
"memberCount": 2,
"members": [
{
"identity": "172.20.33.38:7233"
},
{
"identity": "172.20.102.53:7233"
}
]
},
{
"role": "history",
"memberCount": 8,
"members": [
{
"identity": "172.20.108.131:7234"
},
{
"identity": "172.20.93.245:7234"
},
{
"identity": "172.20.121.25:7234"
},
{
"identity": "172.20.83.145:7234"
},
{
"identity": "172.20.53.107:7234"
},
{
"identity": "172.20.39.234:7234"
},
{
"identity": "172.20.39.233:7234"
},
{
"identity": "172.20.94.48:7234"
}
]
},
{
"role": "matching",
"memberCount": 2,
"members": [
{
"identity": "172.20.99.131:7235"
},
{
"identity": "172.20.43.30:7235"
}
]
},
{
"role": "worker",
"memberCount": 2,
"members": [
{
"identity": "172.20.73.81:7239"
},
{
"identity": "172.20.100.35:7239"
}
]
}
]
},
"clusterId": "f2c3c77c-9e5b-4bdb-8ef7-38557729adcc",
"clusterName": "active",
"historyShardCount": 12288,
"persistenceStore": "cassandra",
"visibilityStore": "elasticsearch",
"failoverVersionIncrement": "10",
"initialFailoverVersion": "1"
}
Any ideas? Trying to figure this out. I’m not expecting a straight answer, just a general direction I should look into.