Large amount of exceptions in Temporal Server

Hi,

Currently we get a large amount of exceptions in Temporal server, clustered at certain intervals in time.
All exceptions seem database related. The exceptions with the largest amount of occurrences are:

mb.error.keyword: Descending Count Count percentages
UpdateShard failed. Failed to start transaction. Error: context deadline exceeded 1,714 51.027%
shard status unknown 464 13.814%
context deadline exceeded 427 12.712%

Also see the picture from Kibana statistics attached to this topic.

Currently we do not have a clue where to start in order to find the cause of these exceptions. We do not have any known production issues related to these exceptions, however the large amount in combination with the type of exceptions does not feel very well.

Can you help with a clue for the next step we could take to analyze this? Do you have any idea what this could be related to?

Extra information:

Our temporal version:
version: “1.18.5”
web_version: “1.15.0”

We store our data in RDS Postgres.

Kind regards,

Mathijs ter Braak

UpdateShard failed. Failed to start transaction. Error: context deadline exceeded

Trying to start a transaction to your DB hit a a timeout which seems to be causing Temporal shard stability problems. Would check if your DB is overloaded.

From Temporal service side could check metrics:

sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

check for SystemOverloaded cause. To protect your db dynamic config has knobs:

frontend.persistenceMaxQPS
history.persistenceMaxQPS
matching.persistenceMaxQPS

shard status unknown

Issue with shard not being in ready state (shard ownership resolution). Would check your history service restarts (server metrics provides “restarts” counter metric) and also would check persistence latencies:

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

From these errors think you can first start looking at your persistence store (try to scale it up or adjust qps limits if its overloaded) then into your cluster stability too (for example history pods restarting)

Thanks for the suggestions.

We only see these errors logged on Production. On other environments (using the same RDS master/replica setup) Temporal server does not log these errors. I tried hard to generate load in other environments to get the same errors (using Maru), but that did not work.

Without reproduction I am hesitant to apply yur suggestion on production. Instead, I had a closer look at the RDS metrics in the AWS console. What I found is that the errors are logged about 5-6seconds after the WALWriteLock load on our RDS instance is hight for several seconds:
WALWriteLock-Chart

These are the top 2 SQL statements:

I did not investigate if WALWriteLock is only on deletion of timer tasks, it could very well be on other deletes as well.

Note that we have a master/replica RDS setup using Postgres version 11.
Could this increase in WALWriteLock load be an explanation for the error logs?