Currently we get a large amount of exceptions in Temporal server, clustered at certain intervals in time.
All exceptions seem database related. The exceptions with the largest amount of occurrences are:
mb.error.keyword: Descending
Count
Count percentages
UpdateShard failed. Failed to start transaction. Error: context deadline exceeded
1,714
51.027%
shard status unknown
464
13.814%
context deadline exceeded
427
12.712%
Also see the picture from Kibana statistics attached to this topic.
Currently we do not have a clue where to start in order to find the cause of these exceptions. We do not have any known production issues related to these exceptions, however the large amount in combination with the type of exceptions does not feel very well.
Can you help with a clue for the next step we could take to analyze this? Do you have any idea what this could be related to?
UpdateShard failed. Failed to start transaction. Error: context deadline exceeded
Trying to start a transaction to your DB hit a a timeout which seems to be causing Temporal shard stability problems. Would check if your DB is overloaded.
From Temporal service side could check metrics:
sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)
check for SystemOverloaded cause. To protect your db dynamic config has knobs:
Issue with shard not being in ready state (shard ownership resolution). Would check your history service restarts (server metrics provides “restarts” counter metric) and also would check persistence latencies:
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))
From these errors think you can first start looking at your persistence store (try to scale it up or adjust qps limits if its overloaded) then into your cluster stability too (for example history pods restarting)
We only see these errors logged on Production. On other environments (using the same RDS master/replica setup) Temporal server does not log these errors. I tried hard to generate load in other environments to get the same errors (using Maru), but that did not work.
Without reproduction I am hesitant to apply yur suggestion on production. Instead, I had a closer look at the RDS metrics in the AWS console. What I found is that the errors are logged about 5-6seconds after the WALWriteLock load on our RDS instance is hight for several seconds: