Temporal jobs stuck in Running after brief network outage

Sagar_Agarwal · November 18, 2025, 10:24pm

Context:
We deploy temporal in k8s. We had a sudden influx in traffic causing more than usual workflows to be executed. For 15 mins, we saw high packets dropped by our NAT gateway, but it recovered itself. After all this, the temporal cluster was left in a bad state. All the history pods started showing context deadline exceed for all sorts of operations like GetVisibilityTasks, GetWorkflowExecution etc.

Questions:

After that brief network problem was solved, why was temporal not able to get to a steady state?
Even now, there are a bunch of workflows stuck in Running state. Any insight on what might have caused it?

Saw spike in below metrics,
persistence_error_with_type → serviceerror_Unavailable

service_errors_resource_exhausted → resource_exhausted_cause=“BusyWorkflow”
persistence_latency_bucket → operation GetWorkflowExecution latency usually hovers around 500ms, but went upto 5s due to the network problem, but after network resolution it just stayed there, never got down.
cache_latency_bucket → HistoryCacheGetOrCreate → went up from normal 450ms to ~1s and stayed there even after network fix.

tihomir · December 23, 2025, 7:04pm

Can you show your history resource utilization before, during and after incident if you have it?

If not before and during, can show it now?

Can you show graphs of these metrics your mentioning in post please during same duration of time (so all graphs match on time and timezone)?

Can you show persistence-related graphs

sum(rate(persistence_requests{}[1m])) by (operation)
sum(rate(persistence_error_with_type[1m])) by (operation)

Fact thar your seeing BusyWorkflow and increased latency on HistoryCacheGetOrCreate, i think there is a big backlog possibly in history (transfer queues). Looks like also for visibility transfer queue as well.
Yeah please check your history hosts memory especially

Topic		Replies	Views
DEADLINE_EXCEEDED: deadline exceeded after 9.999933037s Community Support java-sdk	9	2586	July 13, 2023
Temporal production deployment stopped working Community Support java-sdk , helm	7	1099	January 15, 2023
Some activities seem to be stuck & not starting Server Deployment	3	1755	December 10, 2023
Temporal cluster always seems to be out of resources but always seems healthy Community Support general-impl , tctl	4	2021	July 5, 2023
Workflow backlog while running maru 12k test in kubernetes cluster Community Support	3	725	March 3, 2023

Temporal jobs stuck in Running after brief network outage

Related topics