Hi all,
I’m having trouble getting Temporal run in our production environment. I’m currently running all 4 services as a single container as a k8s pod, with 2 pods running. I’ve opened the same exact ports as the docker container. I’m using RDS for the backing datastore.
The temporal-server containers simply loop, seemingly stuck with the following error message:
{"level":"error","ts":"2020-11-03T23:41:56.067Z","msg":"Error updating timer ack level for shard","service":"history","shard-id":4,"address":"127.0.0.1:7234","shard-item":"0xc0013dcd00","component":"timer-queue-processor","cluster-name":"active","error":"Failed to update shard. Previous range ID: 5717; new range ID: 5718","logging-call-at":"timerQueueAckMgr.go:391","stacktrace":"go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\ngo.temporal.io/server/service/history.(*timerQueueAckMgrImpl).updateAckLevel\n\t/temporal/service/history/timerQueueAckMgr.go:391\ngo.temporal.io/server/service/history.(*timerQueueProcessorBase).internalProcessor\n\t/temporal/service/history/timerQueueProcessorBase.go:319\ngo.temporal.io/server/service/history.(*timerQueueProcessorBase).processorPump\n\t/temporal/service/history/timerQueueProcessorBase.go:194"}
I also see persistence failures:
{"level":"error","ts":"2020-11-03T23:41:10.869Z","msg":"Persistent store operation failure","service":"matching","component":"matching-engine","wf-task-queue-name":"/_sys/temporal-sys-tq-scanner-taskqueue-0/2","wf-task-queue-type":"Activity","store-operation":"update-task-queue","error":"Persistence Max QPS Reached.","logging-call-at":"taskReader.go:166","stacktrace":"go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\ngo.temporal.io/server/service/matching.(*taskReader).getTasksPump\n\t/temporal/service/matching/taskReader.go:166"}
Also seeing the following error rarely:
{"level":"error","ts":"2020-11-03T23:18:01.139Z","msg":"Error looking up host for shardID","service":"history","component":"shard-controller","address":"127.0.0.1:7234","error":"Not enough hosts to serve the request","operation-result":"OperationFailed","shard-id":1,"logging-call-at":"shardController.go:343","stacktrace":"go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\ngo.temporal.io/server/service/history.(*shardController).acquireShards.func1\n\t/temporal/service/history/shardController.go:343"}
The frontend API isn’t responding to tctl requests so I don’t have much to debug with right now. The frontend does show the grpc listener coming up though.
Not sure how to proceed without digging into the code, any ideas? I’d be curious if there is a way to directly inspect the cluster/gossip output from the database?