Workflow backlog while running maru 12k test in kubernetes cluster

Cluster Info
I have deployed temporal using helm chart in EKS cluster. Cluster has 1024 shards, 2 history, 2 matching and 2 frontend pods, docker image is temporalio/server:1.18.4. The persistence storage is AWS RDS mysql with r6g.large instance ( 2cpu, 16 gb ram).

Problem
While running the 12k maru run, workflows are executed at a good rate until we hit a short period where no workflows are getting executed. After this period, backlog workflows are executed at a slow workflow closing rate. What could be the possible reason for this behaviour? Any pointers for avoidance is much appreciated.

histogram csv
download

maru config

{
    "steps": [{
        "count": 12000,
        "ratePerSecond": 100,
        "concurrency": 10
    }],
    "workflow": {
        "name": "basic-workflow",
        "args": {
            "sequenceCount": 3
        }
    },
    "report": {
        "intervalInSeconds": 10
    }
}

Hi, sorry for late reply, were you able to move forward with this issue?

In these situations would help to look at some of following metrics

Server:
Persistence latencies:
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

Sync match rate
sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))
and
sum(rate(persistence_requests{operation=“CreateTask”}[1m]))
which an both be good indication of not having enough workers on your end.

Lock contention:
histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation=“HistoryCacheGetOrCreate”}[1m])) by (le))
which can be good indication to restructure your workflow code (for example not start too many activities/child workflows in a very short amount on time)

Could start from this and if you can share we can go from there.

@tihomir I see the a lot of these errors, could this be the reason?

Most of the errors are of this type

component:transfer-queue-processor
error:"context deadline exceeded"

Then there are below errors in moderate amount

"service failures",  
error:"GetWorkflowExecution: failed to get timer info. Error: Failed to get timer info. Error: context deadline exceeded"

----
"Update workflow execution operation failed.", 
error: "context deadline exceeded"

----

"Operation failed with internal error."
error: "GetWorkflowExecution: failed to get timer info. Error: Failed to get timer info. Error: context deadline exceeded"

So looks as there is a timeout (“context deadline exceeded”) on trying to get data from timer_info_maps table. If transient think can be ignored.

Were you able to run the persistence latencies Grafana query by chance?