Errors in temporal history and matching service logs

Hi,

My temporal server is deployed on GKE cluster. We are using Cassandra for it’s db layer which is also deployed on another GKE cluster. We are continously observing following errors in temporal history service logs but our workflows are being executed successfully:

"cluster-name":"active", "component":"timer-queue-processor", "error":"Failed to update shard.  previous_range_id: 9, columns: (range_id=11)", "level":"error", "logging-call-at":"timerQueueAckMgr.go:402", "msg":"Error updating timer ack level for shard", "service":"history", "shard-id":555,
"component":"visibility-queue-processor", "error":"Failed to update shard.  previous_range_id: 14, columns: (range_id=16)", "level":"error", "logging-call-at":"queueAckMgr.go:225", "msg":"Error updating ack level for shard", "operation-result":"OperationFailed", "service":"history", "shard-id":441
"cluster-name":"active", "component":"transfer-queue-processor", "error":"Failed to update shard.  previous_range_id: 11, columns: (range_id=15)", "level":"error", "logging-call-at":"queueAckMgr.go:225", "msg":"Error updating ack level for shard", "operation-result":"OperationFailed", "service":"history", "shard-id":378,

We have kept our numHistoryShards value to 4096. We tried to install temporal with fresh schema, but still seeing these errors in our logs.

We also observed one error in matching service:

"component":"matching-engine", "error":"Failed to update task queue. name: /_sys/DebitCardIntegrationActivityTaskQueue/1, type: Activity, rangeID: 4, columns: (range_id=5)", "level":"error", "logging-call-at":"taskReader.go:187", "msg":"Persistent store operation failure", "service":"matching"

What could be the issue? Please help.

What’s the server version you are using?

Are these errors transient, meaning do they happen only during pod restarts and then go away? If so they can be ignored.

If not then things you could check

  • possible network issues on your side (for example btw cluster and cassandra)
  • check if you are using a different membership port for different services (history/matching/frontend/worker)
  • Cassandra logs/errors if this happens during high load. You can “protect” Cassandra via dynamic config values below to not push it over the limit:

frontend.persistenceMaxQPS - frontend persistence max qps, default 2000
history.persistenceMaxQPS - history persistence max qps, default 9000
matching.persistenceMaxQPS - matching persistence max qps, default 3000

  • Check asyncmatch_latency server metric, it measures async matched tasks from the time they are created to delivered. The larger this latency the longer tasks are sitting in the queue waiting for your workers to pick them up.

We are using 1.14.0 version of temporal.

Thanks @tihomir , will check on other points.