On this Friday afternoon from 15:45 PM to 16:30 PM, our temporal server encountered the error “potential deadlock detected” on over 10 shards. (See the loki error).
We also noticed that the server encountered resource exhausted by busy workflow issue. The workflow lock contention latency also hiked during that time.
Our persistence latency at that time is quite stable. Max persistence latency is from matching service which is below around 80ms.
Our sync match rate is above 0.96, it did drop from 1 to 0.96 several times.
shard lock contention is 0.000990 which is very low.
Can you compare your resource exhausted graph (times) with:
service errors by type sum(rate(service_errors{service_name="frontend"}[1m]) or on () vector(0))
persistene latency by operation: sum(rate(service_errors{service_name="frontend"}[1m]) or on () vector(0))
Activity and Workflow task timeouts:
sum(rate(start_to_close_timeout{operation="TimerActiveTaskActivityTimeout"}[5m])) by(namespace,operation)
sum by (temporal_namespace,operation) (rate(schedule_to_start_timeout{operation="TimerActiveTaskActivityTimeout"}[1m]))
sum(rate(start_to_close_timeout{operation="TimerActiveTaskWorkflowTaskTimeout"}[1m])) by(namespace,operation)
we probably started too many activities/child workflows from single workflow execution?
that is typically the case, yes. operation for busyworkflow is RecordWorkflowTaskStarted my guess here is that we probably started a large number of activities in parallel that completed at the same time or very close to each other. idk your use case to tell for sure but looks like it
service errors by type - frontend service no error during that time
persistence latency by operation
At 16:00 PM, ListClustermetadata latency is 1.92 min. At 16:23pm, UpdateWorkflowExecution is 2.75 min.
3.Time to Close timeout and schedule_to_start_timeout metrics somehow is missing. I can only see system workflows not workflows under custom namespaces.
@tihomir Sorry to keep bugging you. We have a scheduled workflow runs every 1 minute. It take less than 5 seconds to finish the execution. I do see several consecutive workflow executions with same workflow id and different run id.
Do you think the issue might be related this workflow?