Hello!
We’re getting a temporal cluster set up in our staging kubernetes environment for some prototyping, and I’m running into some issues. I’ve found similar topics and implemented config changes based on them, but the issue persists.
Prior art:
To TLDR summary the questions (details below):
Ring membership isn’t working. Since connectivity between pods has been verified, and cluster_membership is updated with reachable IPs, I’m not sure where to look next. Thoughts?
Are the persistence max qps errors related, or a separate issue?
If it’s not related to the above, why is history in a restart loop?
First, our setup:
- MySql Database hosted in RDS
- 3 pods running history, matching, frontend, and service-worker services, launched using docker image temporalio/server:1.8.2 via the
./start.sh
command. - BIND_ON_IP: 0.0.0.0
- TEMPORAL_BROADCAST_IP: IP of the pod within our cluster
- NUM_HISTORY_SHARDS: 4096
- Ports exposed from each pod: 6933, 6934, 6935, 6939, 7233, 7234, 7235, 7239
- Temporal web deployed to a single pod, using the temporalio/web:1.8.1 image via
npm run start
We’ve verified connectivity on these ports from other pods in the cluster using telnet.
Temporal web loads and doesn’t show any errors (so presumably it can communicate with frontend)
One problem: history service is in a restart loop.
We are seeing the following errors:
"msg":"unable to bootstrap ringpop. retrying","service":"history","error":"join duration of 41.600341725s exceeded max 30s",
Second problem: Probably related, but even though we bootsrap the hosts in the cluster correctly, each pod only sees itself. This happens for all services, which is likely a problem too - only history enters a restart loop though.
"bootstrap hosts fetched","service":"worker","bootstrap-hostports":"100.98.189.48:6935,100.99.41.111:6934,...etc"
"msg":"Current reachable members","service":"worker","component":"service-resolver","service":"worker","addresses":["100.99.41.111:7239"]
"msg":"Current reachable members","service":"worker","component":"service-resolver","service":"matching","addresses":["100.99.41.111:7235"]
"msg":"Current reachable members","service":"matching","component":"service-resolver","service":"worker","addresses":["100.99.16.33:7239"]
"msg":"Current reachable members","service":"frontend","component":"service-resolver","service":"worker","addresses":["100.99.16.33:7239"]
...etc
Looking in our database, I see the cluster_membership table getting updated with heartbeats frequently, as expected.
Jumping onto the pod and describing the cluster, I see similar:
bash-5.0# /usr/local/bin/tctl --ad localhost:7233 adm cl describe
{
"supportedClients": {
"temporal-cli": "\u003c2.0.0",
"temporal-go": "\u003c2.0.0",
"temporal-java": "\u003c2.0.0",
"temporal-server": "\u003c2.0.0"
},
"serverVersion": "1.8.2",
"membershipInfo": {
"currentHost": {
"identity": "100.99.49.83:7233"
},
"reachableMembers": [
"100.99.49.83:6933",
"100.99.49.83:6939",
"100.99.49.83:6934",
"100.99.49.83:6935"
],
"rings": [
{
"role": "frontend",
"memberCount": 1,
"members": [
{
"identity": "100.99.49.83:7233"
}
]
},
{
"role": "history",
"memberCount": 1,
"members": [
{
"identity": "100.99.49.83:7234"
}
]
},
{
"role": "matching",
"memberCount": 1,
"members": [
{
"identity": "100.99.49.83:7235"
}
]
},
{
"role": "worker",
"memberCount": 1,
"members": [
{
"identity": "100.99.49.83:7239"
}
]
}
]
}
}
I believe the following errors are what cause the restart loop.
"msg":"Error updating ack level for shard","service":"history","shard-id":959,"address":"100.99.16.33:7234","shard-item":"0xc037562f00","component":"visibility-queue-processor","error":"Failed to update shard. Previous range ID: 471; new range ID: 473","operation-result":"OperationFailed"
...
"msg":"Error updating timer ack level for shard","service":"history","shard-id":3076,"address":"100.99.16.33:7234","shard-item":"0xc04bbc2580","component":"timer-queue-processor","cluster-name":"active","error":"Failed to update shard. Previous range ID: 488; new range ID: 490"
...
"msg":"Error updating timer ack level for shard","service":"history","shard-id":3122,"address":"100.99.41.113:7234","shard-item":"0xc0327b0c00","component":"timer-queue-processor","cluster-name":"active","error":"Failed to update shard. Previous range ID: 487; new range ID: 489"
...
"msg":"Unable to create history shard engine","service":"history","component":"shard-controller","address":"100.99.49.83:7234","error":"Persistence Max QPS Reached.","operation-result":"OperationFailed","shard-id":3375,"
"msg":"Persistent store operation failure","service":"history","shard-id":3384,"address":"100.99.49.83:7234","shard-item":"0xc03dc35a80","store-operation":"update-shard","error":"Persistence Max QPS Reached.","shard-range-id":492,"previous-shard-range-id":491
Note - the address line of this log is the address of the pod emitting the log - does that mean it’s trying to communicate with itself and failing?
I saw in another thread that the “Persistence Max QPS” error could indicate the DB is overloaded, but we’re running a R5 Large instance and it has zero traffic besides the baseline for these services starting up. CPU is running at 18-25%, with plenty of free mem. From similar threads on Cadence, I’m guessing this is caused by history running in a restart loop, and is a secondary problem.
Any suggestions are greatly appreciated.
Thanks!