Context:
We deploy temporal in k8s. We had a sudden influx in traffic causing more than usual workflows to be executed. For 15 mins, we saw high packets dropped by our NAT gateway, but it recovered itself. After all this, the temporal cluster was left in a bad state. All the history pods started showing context deadline exceed for all sorts of operations like GetVisibilityTasks, GetWorkflowExecution etc.
Questions:
- After that brief network problem was solved, why was temporal not able to get to a steady state?
- Even now, there are a bunch of workflows stuck in Running state. Any insight on what might have caused it?
Saw spike in below metrics,
persistence_error_with_type → serviceerror_Unavailable
service_errors_resource_exhausted → resource_exhausted_cause=“BusyWorkflow”
persistence_latency_bucket → operation GetWorkflowExecution latency usually hovers around 500ms, but went upto 5s due to the network problem, but after network resolution it just stayed there, never got down.
cache_latency_bucket → HistoryCacheGetOrCreate → went up from normal 450ms to ~1s and stayed there even after network fix.