Matching service start/stop loop in production deployment

benmays · November 4, 2020, 12:03am

Hi all,

I’m having trouble getting Temporal run in our production environment. I’m currently running all 4 services as a single container as a k8s pod, with 2 pods running. I’ve opened the same exact ports as the docker container. I’m using RDS for the backing datastore.

The temporal-server containers simply loop, seemingly stuck with the following error message:

{"level":"error","ts":"2020-11-03T23:41:56.067Z","msg":"Error updating timer ack level for shard","service":"history","shard-id":4,"address":"127.0.0.1:7234","shard-item":"0xc0013dcd00","component":"timer-queue-processor","cluster-name":"active","error":"Failed to update shard. Previous range ID: 5717; new range ID: 5718","logging-call-at":"timerQueueAckMgr.go:391","stacktrace":"go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\ngo.temporal.io/server/service/history.(*timerQueueAckMgrImpl).updateAckLevel\n\t/temporal/service/history/timerQueueAckMgr.go:391\ngo.temporal.io/server/service/history.(*timerQueueProcessorBase).internalProcessor\n\t/temporal/service/history/timerQueueProcessorBase.go:319\ngo.temporal.io/server/service/history.(*timerQueueProcessorBase).processorPump\n\t/temporal/service/history/timerQueueProcessorBase.go:194"}

I also see persistence failures:

{"level":"error","ts":"2020-11-03T23:41:10.869Z","msg":"Persistent store operation failure","service":"matching","component":"matching-engine","wf-task-queue-name":"/_sys/temporal-sys-tq-scanner-taskqueue-0/2","wf-task-queue-type":"Activity","store-operation":"update-task-queue","error":"Persistence Max QPS Reached.","logging-call-at":"taskReader.go:166","stacktrace":"go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\ngo.temporal.io/server/service/matching.(*taskReader).getTasksPump\n\t/temporal/service/matching/taskReader.go:166"}

Also seeing the following error rarely:

{"level":"error","ts":"2020-11-03T23:18:01.139Z","msg":"Error looking up host for shardID","service":"history","component":"shard-controller","address":"127.0.0.1:7234","error":"Not enough hosts to serve the request","operation-result":"OperationFailed","shard-id":1,"logging-call-at":"shardController.go:343","stacktrace":"go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\ngo.temporal.io/server/service/history.(*shardController).acquireShards.func1\n\t/temporal/service/history/shardController.go:343"}

The frontend API isn’t responding to tctl requests so I don’t have much to debug with right now. The frontend does show the grpc listener coming up though.

Not sure how to proceed without digging into the code, any ideas? I’d be curious if there is a way to directly inspect the cluster/gossip output from the database?

Wenquan_Xing · November 4, 2020, 4:48am

This huge range ID may indicates that your membership (frontend / matching / history) ring is not stable, one host may steal the shard from another one, like a ping pong game.

this means that the persistence layer is under high load.

can you shard the service config? also make sure the servers (frontend / matching / history) can talk to each other.

benmays · November 5, 2020, 6:26am

Thanks for the quick turnaround! It looks like the issue was caused because my temporal-server peers could reach each other. I was able to resolve it by setting the broadcast address to each host’s public ip and then binding on 0.0.0.0. I also opened up additional ports for peer-to-peer communication.

I found some useful info deep in the support tickets, linking here for others:

Ports required for peers: Communication between multiple instances of temporal server
Background info on broadcast address: What is the equivalent of cadence:bootstrapHosts in temporal

Topic		Replies	Views
Crash loop of history service in K8s cluster Community Support history , kubernetes	19	3690	April 30, 2021
Errors in temporal history and matching service logs Community Support cassandra , deployment	2	1199	July 7, 2022
Temporal production deployment stopped working Community Support java-sdk , helm	7	992	January 15, 2023
Matching service high QPS on persistence Community Support cadence	15	2064	August 6, 2021
How do I deploy each service as a separate pod container，temporal worker can not connect to temporal server Server Deployment	2	442	October 12, 2023

Matching service start/stop loop in production deployment

Related topics