Cross posting here since I think maybe the first post was in the wrong place.
We set up temporal in a local docker env backed by postgres for our dev environment (read lightly used). I have noticed that the error message in the title pops up fairly frequently when trying to view workflows. Googling confirms this is related to load, but this is a dev environment and there’s not a whole lot going on, yet this message persists when trying to view workflows. Neither browser refresh, or server restart (temporal, temporal ui) fixes it, and it remains in that state.
How would the server get in this state, and more importantly how can I get it out of it?
points to persistence (qps) limits reached. you can look at your persistence requests via server metrics: sum(rate(persistence_requests{}[1m]))
(or can do by operation too if you want to see which operations contribute most)
then look which service(s) say that qps limits reached: sum by (resource_exhausted_cause,namespace,operation) (rate(service_errors_resource_exhausted{}[1m]))
(look by service_name and resource exhausted type being SystemOverloaded,
then try adjust your dynamic configs needed if db can handle extra volume:
frontend.persistenceMaxQPS matching.persistenceMaxQPS history.persistenceMaxQPS
Per service type dynamic configs: history.persistenceGlobalMaxQPS
Per shard dynamic configs: history.persistencePerShardNamespaceMaxQPS