How to read Grafana Performance metrics

On this Friday afternoon from 15:45 PM to 16:30 PM, our temporal server encountered the error “potential deadlock detected” on over 10 shards. (See the loki error).

We also noticed that the server encountered resource exhausted by busy workflow issue. The workflow lock contention latency also hiked during that time.

Our persistence latency at that time is quite stable. Max persistence latency is from matching service which is below around 80ms.

Our sync match rate is above 0.96, it did drop from 1 to 0.96 several times.

shard lock contention is 0.000990 which is very low.

IMHO, resource exhausted on BusyWorkflow means we probably started too many activities/child workflows from single workflow execution?

What caused the history shard deadlock? is it due to workflows holding the lock?

It will be great if you guys can chime in for potential issues in our server.

Thank you!






@tihomir @maxim

Follow up: the goroutine profile dump shows possible bottleneck is at processTask method fifo_scheduler.go: 213.

It will be great if you can share how to analyze the profile dump here. Thank you!

Can you compare your resource exhausted graph (times) with:

  1. service errors by type
    sum(rate(service_errors{service_name="frontend"}[1m]) or on () vector(0))

  2. persistene latency by operation:
    sum(rate(service_errors{service_name="frontend"}[1m]) or on () vector(0))

  3. Activity and Workflow task timeouts:

sum(rate(start_to_close_timeout{operation="TimerActiveTaskActivityTimeout"}[5m])) by(namespace,operation)

sum by (temporal_namespace,operation) (rate(schedule_to_start_timeout{operation="TimerActiveTaskActivityTimeout"}[1m]))

sum(rate(start_to_close_timeout{operation="TimerActiveTaskWorkflowTaskTimeout"}[1m])) by(namespace,operation)

we probably started too many activities/child workflows from single workflow execution?

that is typically the case, yes. operation for busyworkflow is RecordWorkflowTaskStarted my guess here is that we probably started a large number of activities in parallel that completed at the same time or very close to each other. idk your use case to tell for sure but looks like it

1 Like

Thank you @tihomir

  1. service errors by type - frontend service no error during that time

  2. persistence latency by operation
    At 16:00 PM, ListClustermetadata latency is 1.92 min. At 16:23pm, UpdateWorkflowExecution is 2.75 min.

3.Time to Close timeout and schedule_to_start_timeout metrics somehow is missing. I can only see system workflows not workflows under custom namespaces.


Please feel free to let me know what you think. Thank you!

And one more question, What caused the history shard deadlock? is it due to workflows holding the lock?

with regards to workflow activities, the workflow with most activities is only with 5 activities.

@tihomir Sorry to keep bugging you. We have a scheduled workflow runs every 1 minute. It take less than 5 seconds to finish the execution. I do see several consecutive workflow executions with same workflow id and different run id.

Do you think the issue might be related this workflow?