Errors in temporal history and matching service logs

Ruchir · July 4, 2022, 9:17am

Hi,

My temporal server is deployed on GKE cluster. We are using Cassandra for it’s db layer which is also deployed on another GKE cluster. We are continously observing following errors in temporal history service logs but our workflows are being executed successfully:

"cluster-name":"active", "component":"timer-queue-processor", "error":"Failed to update shard.  previous_range_id: 9, columns: (range_id=11)", "level":"error", "logging-call-at":"timerQueueAckMgr.go:402", "msg":"Error updating timer ack level for shard", "service":"history", "shard-id":555,

"component":"visibility-queue-processor", "error":"Failed to update shard.  previous_range_id: 14, columns: (range_id=16)", "level":"error", "logging-call-at":"queueAckMgr.go:225", "msg":"Error updating ack level for shard", "operation-result":"OperationFailed", "service":"history", "shard-id":441

"cluster-name":"active", "component":"transfer-queue-processor", "error":"Failed to update shard.  previous_range_id: 11, columns: (range_id=15)", "level":"error", "logging-call-at":"queueAckMgr.go:225", "msg":"Error updating ack level for shard", "operation-result":"OperationFailed", "service":"history", "shard-id":378,

We have kept our numHistoryShards value to 4096. We tried to install temporal with fresh schema, but still seeing these errors in our logs.

We also observed one error in matching service:

"component":"matching-engine", "error":"Failed to update task queue. name: /_sys/DebitCardIntegrationActivityTaskQueue/1, type: Activity, rangeID: 4, columns: (range_id=5)", "level":"error", "logging-call-at":"taskReader.go:187", "msg":"Persistent store operation failure", "service":"matching"

What could be the issue? Please help.

tihomir · July 4, 2022, 2:59pm

What’s the server version you are using?

Are these errors transient, meaning do they happen only during pod restarts and then go away? If so they can be ignored.

If not then things you could check

possible network issues on your side (for example btw cluster and cassandra)
check if you are using a different membership port for different services (history/matching/frontend/worker)
Cassandra logs/errors if this happens during high load. You can “protect” Cassandra via dynamic config values below to not push it over the limit:

frontend.persistenceMaxQPS - frontend persistence max qps, default 2000
history.persistenceMaxQPS - history persistence max qps, default 9000
matching.persistenceMaxQPS - matching persistence max qps, default 3000

Check asyncmatch_latency server metric, it measures async matched tasks from the time they are created to delivered. The larger this latency the longer tasks are sitting in the queue waiting for your workers to pick them up.

Ruchir · July 7, 2022, 12:04pm

We are using 1.14.0 version of temporal.

Thanks @tihomir , will check on other points.

Topic		Replies	Views
Errors on Temporal History Server Community Support history	3	557	July 4, 2023
Matching service start/stop loop in production deployment Community Support	2	2234	November 5, 2020
Temporal service history Failed to update shard Community Support history , web-ui	0	331	February 2, 2024
Temporal production deployment stopped working Community Support java-sdk , helm	7	992	January 15, 2023
Temporal not writing to cassandra? Community Support cassandra	2	532	March 2, 2022

Errors in temporal history and matching service logs

Related topics