How to read Grafana Performance metrics

andyng · January 27, 2025, 2:24am

On this Friday afternoon from 15:45 PM to 16:30 PM, our temporal server encountered the error “potential deadlock detected” on over 10 shards. (See the loki error).

We also noticed that the server encountered resource exhausted by busy workflow issue. The workflow lock contention latency also hiked during that time.

Our persistence latency at that time is quite stable. Max persistence latency is from matching service which is below around 80ms.

Our sync match rate is above 0.96, it did drop from 1 to 0.96 several times.

shard lock contention is 0.000990 which is very low.

IMHO, resource exhausted on BusyWorkflow means we probably started too many activities/child workflows from single workflow execution?

What caused the history shard deadlock? is it due to workflows holding the lock?

It will be great if you guys can chime in for potential issues in our server.

Thank you!

@tihomir @maxim

andyng · January 27, 2025, 2:30am

Follow up: the goroutine profile dump shows possible bottleneck is at processTask method fifo_scheduler.go: 213.

It will be great if you can share how to analyze the profile dump here. Thank you!

tihomir · January 27, 2025, 4:00am

Can you compare your resource exhausted graph (times) with:

service errors by type
sum(rate(service_errors{service_name="frontend"}[1m]) or on () vector(0))
persistene latency by operation:
sum(rate(service_errors{service_name="frontend"}[1m]) or on () vector(0))
Activity and Workflow task timeouts:

sum(rate(start_to_close_timeout{operation="TimerActiveTaskActivityTimeout"}[5m])) by(namespace,operation)

sum by (temporal_namespace,operation) (rate(schedule_to_start_timeout{operation="TimerActiveTaskActivityTimeout"}[1m]))

sum(rate(start_to_close_timeout{operation="TimerActiveTaskWorkflowTaskTimeout"}[1m])) by(namespace,operation)

we probably started too many activities/child workflows from single workflow execution?

that is typically the case, yes. operation for busyworkflow is RecordWorkflowTaskStarted my guess here is that we probably started a large number of activities in parallel that completed at the same time or very close to each other. idk your use case to tell for sure but looks like it

andyng · January 27, 2025, 5:21am

Thank you @tihomir

service errors by type - frontend service no error during that time
persistence latency by operation
At 16:00 PM, ListClustermetadata latency is 1.92 min. At 16:23pm, UpdateWorkflowExecution is 2.75 min.

3.Time to Close timeout and schedule_to_start_timeout metrics somehow is missing. I can only see system workflows not workflows under custom namespaces.

Please feel free to let me know what you think. Thank you!

andyng · January 27, 2025, 5:43am

And one more question, What caused the history shard deadlock? is it due to workflows holding the lock?

andyng · January 27, 2025, 5:51am

with regards to workflow activities, the workflow with most activities is only with 5 activities.

andyng · January 28, 2025, 3:10am

@tihomir Sorry to keep bugging you. We have a scheduled workflow runs every 1 minute. It take less than 5 seconds to finish the execution. I do see several consecutive workflow executions with same workflow id and different run id.

Do you think the issue might be related this workflow?

Topic		Replies	Views
How to investigate or solve occasional shard operations causing ResourceExhausted errors Server Deployment	2	125	January 14, 2025
Workflow Performance with Java SDK Community Support java-sdk	1	744	February 20, 2023
Clarification on metrics (client + server) Community Support java-sdk , metrics	14	2643	April 13, 2022
Temporal Shards ( configured in 'shards' table ) Community Support java-sdk	6	270	February 10, 2025
Temporal performance issues Community Support java-sdk , performance , worker , kubernetes	1	1840	April 26, 2023

How to read Grafana Performance metrics

Related topics