Matching service start/stop loop in production deployment

Hi all,

I’m having trouble getting Temporal run in our production environment. I’m currently running all 4 services as a single container as a k8s pod, with 2 pods running. I’ve opened the same exact ports as the docker container. I’m using RDS for the backing datastore.

The temporal-server containers simply loop, seemingly stuck with the following error message:

{"level":"error","ts":"2020-11-03T23:41:56.067Z","msg":"Error updating timer ack level for shard","service":"history","shard-id":4,"address":"","shard-item":"0xc0013dcd00","component":"timer-queue-processor","cluster-name":"active","error":"Failed to update shard. Previous range ID: 5717; new range ID: 5718","logging-call-at":"timerQueueAckMgr.go:391","stacktrace":"*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\*timerQueueAckMgrImpl).updateAckLevel\n\t/temporal/service/history/timerQueueAckMgr.go:391\*timerQueueProcessorBase).internalProcessor\n\t/temporal/service/history/timerQueueProcessorBase.go:319\*timerQueueProcessorBase).processorPump\n\t/temporal/service/history/timerQueueProcessorBase.go:194"}

I also see persistence failures:

{"level":"error","ts":"2020-11-03T23:41:10.869Z","msg":"Persistent store operation failure","service":"matching","component":"matching-engine","wf-task-queue-name":"/_sys/temporal-sys-tq-scanner-taskqueue-0/2","wf-task-queue-type":"Activity","store-operation":"update-task-queue","error":"Persistence Max QPS Reached.","logging-call-at":"taskReader.go:166","stacktrace":"*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\*taskReader).getTasksPump\n\t/temporal/service/matching/taskReader.go:166"}

Also seeing the following error rarely:

{"level":"error","ts":"2020-11-03T23:18:01.139Z","msg":"Error looking up host for shardID","service":"history","component":"shard-controller","address":"","error":"Not enough hosts to serve the request","operation-result":"OperationFailed","shard-id":1,"logging-call-at":"shardController.go:343","stacktrace":"*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\*shardController).acquireShards.func1\n\t/temporal/service/history/shardController.go:343"}

The frontend API isn’t responding to tctl requests so I don’t have much to debug with right now. The frontend does show the grpc listener coming up though.

Not sure how to proceed without digging into the code, any ideas? I’d be curious if there is a way to directly inspect the cluster/gossip output from the database?

This huge range ID may indicates that your membership (frontend / matching / history) ring is not stable, one host may steal the shard from another one, like a ping pong game.

this means that the persistence layer is under high load.

can you shard the service config? also make sure the servers (frontend / matching / history) can talk to each other.

Thanks for the quick turnaround! It looks like the issue was caused because my temporal-server peers could reach each other. I was able to resolve it by setting the broadcast address to each host’s public ip and then binding on I also opened up additional ports for peer-to-peer communication.

I found some useful info deep in the support tickets, linking here for others: