Crash loop of history service in K8s cluster

Hello!
We’re getting a temporal cluster set up in our staging kubernetes environment for some prototyping, and I’m running into some issues. I’ve found similar topics and implemented config changes based on them, but the issue persists.
Prior art:

To TLDR summary the questions (details below):
Ring membership isn’t working. Since connectivity between pods has been verified, and cluster_membership is updated with reachable IPs, I’m not sure where to look next. Thoughts?
Are the persistence max qps errors related, or a separate issue?
If it’s not related to the above, why is history in a restart loop?

First, our setup:

  • MySql Database hosted in RDS
  • 3 pods running history, matching, frontend, and service-worker services, launched using docker image temporalio/server:1.8.2 via the ./start.sh command.
  • BIND_ON_IP: 0.0.0.0
  • TEMPORAL_BROADCAST_IP: IP of the pod within our cluster
  • NUM_HISTORY_SHARDS: 4096
  • Ports exposed from each pod: 6933, 6934, 6935, 6939, 7233, 7234, 7235, 7239
  • Temporal web deployed to a single pod, using the temporalio/web:1.8.1 image via npm run start

We’ve verified connectivity on these ports from other pods in the cluster using telnet.
Temporal web loads and doesn’t show any errors (so presumably it can communicate with frontend)

One problem: history service is in a restart loop.

We are seeing the following errors:

"msg":"unable to bootstrap ringpop. retrying","service":"history","error":"join duration of 41.600341725s exceeded max 30s",

Second problem: Probably related, but even though we bootsrap the hosts in the cluster correctly, each pod only sees itself. This happens for all services, which is likely a problem too - only history enters a restart loop though.

"bootstrap hosts fetched","service":"worker","bootstrap-hostports":"100.98.189.48:6935,100.99.41.111:6934,...etc"
"msg":"Current reachable members","service":"worker","component":"service-resolver","service":"worker","addresses":["100.99.41.111:7239"]
"msg":"Current reachable members","service":"worker","component":"service-resolver","service":"matching","addresses":["100.99.41.111:7235"]
"msg":"Current reachable members","service":"matching","component":"service-resolver","service":"worker","addresses":["100.99.16.33:7239"]
"msg":"Current reachable members","service":"frontend","component":"service-resolver","service":"worker","addresses":["100.99.16.33:7239"]
...etc

Looking in our database, I see the cluster_membership table getting updated with heartbeats frequently, as expected.

Jumping onto the pod and describing the cluster, I see similar:

bash-5.0# /usr/local/bin/tctl --ad localhost:7233 adm cl describe
{
  "supportedClients": {
    "temporal-cli": "\u003c2.0.0",
    "temporal-go": "\u003c2.0.0",
    "temporal-java": "\u003c2.0.0",
    "temporal-server": "\u003c2.0.0"
  },
  "serverVersion": "1.8.2",
  "membershipInfo": {
    "currentHost": {
      "identity": "100.99.49.83:7233"
    },
    "reachableMembers": [
      "100.99.49.83:6933",
      "100.99.49.83:6939",
      "100.99.49.83:6934",
      "100.99.49.83:6935"
    ],
    "rings": [
      {
        "role": "frontend",
        "memberCount": 1,
        "members": [
          {
            "identity": "100.99.49.83:7233"
          }
        ]
      },
      {
        "role": "history",
        "memberCount": 1,
        "members": [
          {
            "identity": "100.99.49.83:7234"
          }
        ]
      },
      {
        "role": "matching",
        "memberCount": 1,
        "members": [
          {
            "identity": "100.99.49.83:7235"
          }
        ]
      },
      {
        "role": "worker",
        "memberCount": 1,
        "members": [
          {
            "identity": "100.99.49.83:7239"
          }
        ]
      }
    ]
  }
}

I believe the following errors are what cause the restart loop.

"msg":"Error updating ack level for shard","service":"history","shard-id":959,"address":"100.99.16.33:7234","shard-item":"0xc037562f00","component":"visibility-queue-processor","error":"Failed to update shard. Previous range ID: 471; new range ID: 473","operation-result":"OperationFailed"
...
"msg":"Error updating timer ack level for shard","service":"history","shard-id":3076,"address":"100.99.16.33:7234","shard-item":"0xc04bbc2580","component":"timer-queue-processor","cluster-name":"active","error":"Failed to update shard. Previous range ID: 488; new range ID: 490"
...
"msg":"Error updating timer ack level for shard","service":"history","shard-id":3122,"address":"100.99.41.113:7234","shard-item":"0xc0327b0c00","component":"timer-queue-processor","cluster-name":"active","error":"Failed to update shard. Previous range ID: 487; new range ID: 489"
...
"msg":"Unable to create history shard engine","service":"history","component":"shard-controller","address":"100.99.49.83:7234","error":"Persistence Max QPS Reached.","operation-result":"OperationFailed","shard-id":3375,"
"msg":"Persistent store operation failure","service":"history","shard-id":3384,"address":"100.99.49.83:7234","shard-item":"0xc03dc35a80","store-operation":"update-shard","error":"Persistence Max QPS Reached.","shard-range-id":492,"previous-shard-range-id":491

Note - the address line of this log is the address of the pod emitting the log - does that mean it’s trying to communicate with itself and failing?
I saw in another thread that the “Persistence Max QPS” error could indicate the DB is overloaded, but we’re running a R5 Large instance and it has zero traffic besides the baseline for these services starting up. CPU is running at 18-25%, with plenty of free mem. From similar threads on Cadence, I’m guessing this is caused by history running in a restart loop, and is a secondary problem.

Any suggestions are greatly appreciated.

Thanks!

hi, can you try to deploy Temporal one service per pod, e.g.
pod 1 deployed with history service
pod 2 deployed with matching service
pod 2 deployed with frontend service

by
./start.sh --service=history
./start.sh --service=matching
./start.sh --service=frontend

?

Hi Wenquan_Xing,
Thanks for the reply. Would that not defeat the purpose of having a ring / cluster membership? I know that in a more ideal setup we would have separate pods for history, matching, frontend - but I would still want multiple instances of those pods to provide redundancy. Would deploying a single service per pod, one pod each help diagnose the issue?

Thanks,
Mike

probably a miscommunication.

please deploy Temporal one service per pod.

there can be multiple history pods, matching pods, frontend pods and worker pods (Temporal server worker pods, not talking about customer’s SDK / poller)

saying the ^ since without specifying the service, the start script will run all 4 services (frontend / matching / history / worker) within a single pod.

Gotcha. Is the theory that the services can’t coexist on multiple pods? Since they operate on separate ports I assumed negative interactions weren’t likely.

Edit: As I understood it, the recommendation to deploy a single service per pod is motivated by scaling and clarity, not issues joining clusters. Is that not correct?

this is correct.


for now try NOT to do deploy 4 services within a pod since i suspect there are some weird issues, like what you have seen

Ok. While I agree that it’s certainly best practice to have a single instance per pod, this seems tangential to ability to join a cluster. I’ll go ahead and make these changes tomorrow and report back.

Just a quick update - I have the services separated and running in our env, I’m just waiting for our infra team to open the proper ports on each pod. I’ll report back.

We have all the services running independently now. Each service is configured with the pod ID as the broadcastAddress (as per prior art: Unable to bootstrap ringpop )

Logs for each look like

	2021-04-28 13:54:22.043	
temporal-history
preprod
{"level":"info","ts":"2021-04-28T20:54:22.043Z","msg":"Membership heartbeat upserted successfully","service":"history","address":"100.99.199.16","port":6934,"hostId":"e8783778-a863-11eb-bffb-f611ef047b3b","logging-call-at":"rpMonitor.go:222"}
2021-04-28 13:54:22.045	
temporal-history
preprod
{"level":"info","ts":"2021-04-28T20:54:22.045Z","msg":"bootstrap hosts fetched","service":"history","bootstrap-hostports":"100.99.198.10:6933,100.99.199.6:6939,100.99.155.32:6933,100.99.199.16:6934","logging-call-at":"rpMonitor.go:263"}
2021-04-28 13:54:59.323	
temporal-history
preprod
{"level":"error","ts":"2021-04-28T20:54:59.323Z","msg":"unable to bootstrap ringpop. retrying","service":"history","error":"join duration of 37.278092306s exceeded max 30s","logging-call-at":"ringpop.go:114","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/membership.(*RingPop).bootstrap\n\t/temporal/common/membership/ringpop.go:114\ngo.temporal.io/server/common/membership.(*RingPop).Start\n\t/temporal/common/membership/ringpop.go:83\ngo.temporal.io/server/common/membership.(*ringpopMonitor).Start\n\t/temporal/common/membership/rpMonitor.go:120\ngo.temporal.io/server/common/resource.(*Impl).Start\n\t/temporal/common/resource/resourceImpl.go:371\ngo.temporal.io/server/service/history.(*Service).Start\n\t/temporal/service/history/service.go:149\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:192"}
2021-04-28 13:40:21.544	
temporal-matching
preprod
{"level":"error","ts":"2021-04-28T20:40:21.544Z","msg":"unable to bootstrap ringpop. retrying","service":"matching","error":"join duration of 38.590264357s exceeded max 30s","logging-call-at":"ringpop.go:114","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/membership.(*RingPop).bootstrap\n\t/temporal/common/membership/ringpop.go:114\ngo.temporal.io/server/common/membership.(*RingPop).Start\n\t/temporal/common/membership/ringpop.go:83\ngo.temporal.io/server/common/membership.(*ringpopMonitor).Start\n\t/temporal/common/membership/rpMonitor.go:120\ngo.temporal.io/server/common/resource.(*Impl).Start\n\t/temporal/common/resource/resourceImpl.go:371\ngo.temporal.io/server/service/matching.(*Service).Start\n\t/temporal/service/matching/service.go:102\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:192"}
2021-04-28 13:40:28.418	
temporal-matching
preprod
{"level":"error","ts":"2021-04-28T20:40:28.418Z","msg":"unable to bootstrap ringpop. retrying","service":"matching","error":"join duration of 42.19803248s exceeded max 30s","logging-call-at":"ringpop.go:114","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/membership.(*RingPop).bootstrap\n\t/temporal/common/membership/ringpop.go:114\ngo.temporal.io/server/common/membership.(*RingPop).Start\n\t/temporal/common/membership/ringpop.go:83\ngo.temporal.io/server/common/membership.(*ringpopMonitor).Start\n\t/temporal/common/membership/rpMonitor.go:120\ngo.temporal.io/server/common/resource.(*Impl).Start\n\t/temporal/common/resource/resourceImpl.go:371\ngo.temporal.io/server/service/matching.(*Service).Start\n\t/temporal/service/matching/service.go:102\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:192"}

etc.

We can telnet from one host to another using IPs and the membership ports. Not sure what to try next. Our outbound requests go through istio, but since we’re using IP address nothing there should be impactful.

Note, the destination does shut down the connection after telnet connects…

root@lmtest-5f8cb5c779-6gkzj:/# telnet 100.99.230.43 6934
Trying 100.99.230.43...
Connected to 100.99.230.43.
Escape character is '^]'.
Connection closed by foreign host.
root@lmtest-5f8cb5c779-6gkzj:/# telnet 100.99.230.43 6934
Trying 100.99.230.43...
Connected to 100.99.230.43.
Escape character is '^]'.
Connection closed by foreign host.

looks like from ring pop’s point of view, other hosts are not reachable
ref: temporal/ringpop.go at 30c89b9f91dafc13d2d75fda266d81ce8bc72f3e · temporalio/temporal · GitHub

can you provide the membership config section and rpc config sections

Hey,
Sure thing!

global:
    membership:
        maxJoinDuration: 30s
        broadcastAddress: "100.99.230" # this is the pod IP
....
services:
    frontend:
        rpc:
            grpcPort: 7233
            membershipPort: 6933
            bindOnIP: 0.0.0.0

    matching:
        rpc:
            grpcPort: 7235
            membershipPort: 6935
            bindOnIP: 0.0.0.0

    history:
        rpc:
            grpcPort: 7234
            membershipPort: 6934
            bindOnIP: 0.0.0.0

    worker:
        rpc:
            grpcPort: 7239
            membershipPort: 6939
            bindOnIP: 0.0.0.0

1st, make sure this is an valid IP

I actually checked our prod cluster, the config looks the same.

so not sure what is happening…

how do you do the deployment? one pod at a time? one pod per server at a time?

Hi,
It’s valid - the telnet check was to show the IP and port were open. We do rolling deployments, one pod coming up at a time, then one of the old going down, until the target is reached. I tested destroying all the pods to start from zero, and redeploying with existing pods, same results. Because these are in a slow crash loop, our k8s system is periodically replacing pods that have exited too many times. No difference in behavior in any of these scenarios. I suspect it must be something in our network that’s blocking ringpop, but I see no evidence in our istio/etc logs…

one thing maybe worth trying, shutdown all temporal hosts, delete all data from cluster_membership table and re-do the deployment?

Ok, I’ll try that in the morning. When you say all temporal hosts - you mean all components (history, matching, etc), right?

yes, basically shutdown entire temporal and clean that cluster_membership table

Same results. Since our configuration is the same as yours, I assume it the likeliest answer is something different with our networking. Do you have any suggested debugging tools for ringpop? Any connectivity checker CLIs or something?

something worth trying

npm install -g tcurl@4.22.0
cat hosts.json                      
[
    "127.0.0.1:6939"
]
tcurl ringpop -P hosts.json /admin/lookup '{"key": "frontend"}'
tcurl ringpop -P hosts.json /admin/lookup '{"key": "matching"}'
tcurl ringpop -P hosts.json /admin/lookup '{"key": "history"}' 

example output

tcurl ringpop -P hosts.json /admin/lookup '{"key": "history"}' 
{"ok":true,"head":null,"body":{"dest":"127.0.0.1:6833"},"headers":{"as":"json"},"trace":"264ffa2a2d6a2bd9"}

That was helpful. We have everything working - thanks for the assistance and tips.

For future readers - it turns out we did have an issue with our network cutting off connections via IP:port calls. We needed to add some headless services to get around istio.

2 Likes