Workflow backlog while running maru 12k test in kubernetes cluster

Cluster Info
I have deployed temporal using helm chart in EKS cluster. Cluster has 1024 shards, 2 history, 2 matching and 2 frontend pods, docker image is temporalio/server:1.18.4. The persistence storage is AWS RDS mysql with r6g.large instance ( 2cpu, 16 gb ram).

Problem
While running the 12k maru run, workflows are executed at a good rate until we hit a short period where no workflows are getting executed. After this period, backlog workflows are executed at a slow workflow closing rate. What could be the possible reason for this behaviour? Any pointers for avoidance is much appreciated.

histogram csv
download

maru config

{
    "steps": [{
        "count": 12000,
        "ratePerSecond": 100,
        "concurrency": 10
    }],
    "workflow": {
        "name": "basic-workflow",
        "args": {
            "sequenceCount": 3
        }
    },
    "report": {
        "intervalInSeconds": 10
    }
}

Hi, sorry for late reply, were you able to move forward with this issue?

In these situations would help to look at some of following metrics

Server:
Persistence latencies:
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

Sync match rate
sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))
and
sum(rate(persistence_requests{operation=“CreateTask”}[1m]))
which an both be good indication of not having enough workers on your end.

Lock contention:
histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation=“HistoryCacheGetOrCreate”}[1m])) by (le))
which can be good indication to restructure your workflow code (for example not start too many activities/child workflows in a very short amount on time)

Could start from this and if you can share we can go from there.