Cluster Info
I have deployed temporal using helm chart in EKS cluster. Cluster has 1024 shards, 2 history, 2 matching and 2 frontend pods, docker image is temporalio/server:1.18.4. The persistence storage is AWS RDS mysql with r6g.large instance ( 2cpu, 16 gb ram).
Problem
While running the 12k maru run, workflows are executed at a good rate until we hit a short period where no workflows are getting executed. After this period, backlog workflows are executed at a slow workflow closing rate. What could be the possible reason for this behaviour? Any pointers for avoidance is much appreciated.
histogram csv
download
maru config
{
"steps": [{
"count": 12000,
"ratePerSecond": 100,
"concurrency": 10
}],
"workflow": {
"name": "basic-workflow",
"args": {
"sequenceCount": 3
}
},
"report": {
"intervalInSeconds": 10
}
}
Hi, sorry for late reply, were you able to move forward with this issue?
In these situations would help to look at some of following metrics
Server:
Persistence latencies:
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))
Sync match rate
sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))
and
sum(rate(persistence_requests{operation=“CreateTask”}[1m]))
which an both be good indication of not having enough workers on your end.
Lock contention:
histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation=“HistoryCacheGetOrCreate”}[1m])) by (le))
which can be good indication to restructure your workflow code (for example not start too many activities/child workflows in a very short amount on time)
Could start from this and if you can share we can go from there.
@tihomir I see the a lot of these errors, could this be the reason?
Most of the errors are of this type
component:transfer-queue-processor
error:"context deadline exceeded"
Then there are below errors in moderate amount
"service failures",
error:"GetWorkflowExecution: failed to get timer info. Error: Failed to get timer info. Error: context deadline exceeded"
----
"Update workflow execution operation failed.",
error: "context deadline exceeded"
----
"Operation failed with internal error."
error: "GetWorkflowExecution: failed to get timer info. Error: Failed to get timer info. Error: context deadline exceeded"
So looks as there is a timeout (“context deadline exceeded”) on trying to get data from timer_info_maps table. If transient think can be ignored.
Were you able to run the persistence latencies Grafana query by chance?