Workflow backlog while running maru 12k test in kubernetes cluster

AnthonySoprano · November 29, 2022, 9:58pm

Cluster Info
I have deployed temporal using helm chart in EKS cluster. Cluster has 1024 shards, 2 history, 2 matching and 2 frontend pods, docker image is temporalio/server:1.18.4. The persistence storage is AWS RDS mysql with r6g.large instance ( 2cpu, 16 gb ram).

Problem
While running the 12k maru run, workflows are executed at a good rate until we hit a short period where no workflows are getting executed. After this period, backlog workflows are executed at a slow workflow closing rate. What could be the possible reason for this behaviour? Any pointers for avoidance is much appreciated.

histogram csv
download

maru config

{
    "steps": [{
        "count": 12000,
        "ratePerSecond": 100,
        "concurrency": 10
    }],
    "workflow": {
        "name": "basic-workflow",
        "args": {
            "sequenceCount": 3
        }
    },
    "report": {
        "intervalInSeconds": 10
    }
}

tihomir · January 21, 2023, 6:07pm

Hi, sorry for late reply, were you able to move forward with this issue?

In these situations would help to look at some of following metrics

Server:
Persistence latencies:
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

Sync match rate
sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))
and
sum(rate(persistence_requests{operation=“CreateTask”}[1m]))
which an both be good indication of not having enough workers on your end.

Lock contention:
histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation=“HistoryCacheGetOrCreate”}[1m])) by (le))
which can be good indication to restructure your workflow code (for example not start too many activities/child workflows in a very short amount on time)

Could start from this and if you can share we can go from there.

AnthonySoprano · March 3, 2023, 4:45am

@tihomir I see the a lot of these errors, could this be the reason?

Most of the errors are of this type

component:transfer-queue-processor
error:"context deadline exceeded"

Then there are below errors in moderate amount

"service failures",  
error:"GetWorkflowExecution: failed to get timer info. Error: Failed to get timer info. Error: context deadline exceeded"

----
"Update workflow execution operation failed.", 
error: "context deadline exceeded"

----

"Operation failed with internal error."
error: "GetWorkflowExecution: failed to get timer info. Error: Failed to get timer info. Error: context deadline exceeded"

tihomir · March 3, 2023, 5:01am

So looks as there is a timeout (“context deadline exceeded”) on trying to get data from timer_info_maps table. If transient think can be ignored.

Were you able to run the persistence latencies Grafana query by chance?

Topic		Replies	Views
Temporal test bench by maru Community Support go-sdk , helm , metrics	3	750	December 22, 2022
Temporal performance issues Community Support java-sdk , performance , worker , kubernetes	1	1768	April 26, 2023
Workflow process slowness Community Support	6	487	April 14, 2023
Workflow Performance with Java SDK Community Support java-sdk	1	716	February 20, 2023
Temporal + Aurora Mysql Performance Community Support performance	1	1394	July 12, 2021

Workflow backlog while running maru 12k test in kubernetes cluster

Related topics