Large amount of exceptions in Temporal Server

Mathijstb · February 17, 2023, 11:50am

Hi,

Currently we get a large amount of exceptions in Temporal server, clustered at certain intervals in time.
All exceptions seem database related. The exceptions with the largest amount of occurrences are:

mb.error.keyword: Descending	Count	Count percentages
UpdateShard failed. Failed to start transaction. Error: context deadline exceeded	1,714	51.027%
shard status unknown	464	13.814%
context deadline exceeded	427	12.712%

Also see the picture from Kibana statistics attached to this topic.

Currently we do not have a clue where to start in order to find the cause of these exceptions. We do not have any known production issues related to these exceptions, however the large amount in combination with the type of exceptions does not feel very well.

Can you help with a clue for the next step we could take to analyze this? Do you have any idea what this could be related to?

Extra information:

Our temporal version:
version: “1.18.5”
web_version: “1.15.0”

We store our data in RDS Postgres.

Kind regards,

Mathijs ter Braak

tihomir · February 17, 2023, 7:18pm

UpdateShard failed. Failed to start transaction. Error: context deadline exceeded

Trying to start a transaction to your DB hit a a timeout which seems to be causing Temporal shard stability problems. Would check if your DB is overloaded.

From Temporal service side could check metrics:

sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

check for SystemOverloaded cause. To protect your db dynamic config has knobs:

frontend.persistenceMaxQPS
history.persistenceMaxQPS
matching.persistenceMaxQPS

shard status unknown

Issue with shard not being in ready state (shard ownership resolution). Would check your history service restarts (server metrics provides “restarts” counter metric) and also would check persistence latencies:

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

From these errors think you can first start looking at your persistence store (try to scale it up or adjust qps limits if its overloaded) then into your cluster stability too (for example history pods restarting)

michiel_eggermont · July 7, 2023, 11:09am

Thanks for the suggestions.

We only see these errors logged on Production. On other environments (using the same RDS master/replica setup) Temporal server does not log these errors. I tried hard to generate load in other environments to get the same errors (using Maru), but that did not work.

Without reproduction I am hesitant to apply yur suggestion on production. Instead, I had a closer look at the RDS metrics in the AWS console. What I found is that the errors are logged about 5-6seconds after the WALWriteLock load on our RDS instance is hight for several seconds:
WALWriteLock-Chart

These are the top 2 SQL statements:

I did not investigate if WALWriteLock is only on deletion of timer tasks, it could very well be on other deletes as well.

Note that we have a master/replica RDS setup using Postgres version 11.
Could this increase in WALWriteLock load be an explanation for the error logs?

Topic		Replies	Views
Getting `GetOrCreateShard: failed to get ShardID` in temporal-history, and Deadline Exceeded Server Deployment go-sdk	5	846	June 20, 2024
Persistence layer problems Community Support	4	649	December 16, 2024
Unable to create history shard context Error Message Community Support helm , deployment	1	591	June 3, 2022
Errors in temporal history and matching service logs Community Support cassandra , deployment	2	1199	July 7, 2022
Temporal Node Resource usage is very low. But some pods of services keep restarting Community Support general-impl	3	58	January 14, 2025

Large amount of exceptions in Temporal Server

Extra information:

Related topics