State transition builds up and suddenly drops

chillguy16 · February 4, 2025, 5:02am

We are doing capacity planning. We are finding our target state transition per second to estimate resources. However, we notice a weird pattern of state transition in the current setup. The Datadog metric we are reading from is temporal.server.state_transition.count

We have around 1000 workflows running per 30 seconds. In each workflow, 4 child workflows are running. After each workflow run, it will sleep for 30 seconds and start again with continue-as-new.

Do you know why the state transition graph builds up and suddenly drops like this?

tihomir · February 9, 2025, 2:46am

Can you compare graph with that of persistence request for create and update workflow exec (which is same thing really).
Grafana query would look like:

sum(rate(persistence_requests{operation=~"CreateWorkflowExecution|UpdateWorkflowExecution"}[1m]))

Would also try see if dips match possible service host restarts, service emits restarts metric

Topic		Replies	Views
Workflow state transition count doesn't match history Community Support python-sdk	2	183	June 13, 2024
How to get best Temporal Performance? Community Support general-impl , performance , metrics , best-practices , typescript-sdk	4	2319	January 24, 2024
Workflow Performance with Java SDK Community Support java-sdk	1	740	February 20, 2023
Understanding Temporal Grafana Metrics Community Support java-sdk , metrics	2	1583	August 23, 2021
Concurrent Configuration Community Support activity , typescript-sdk , workflow-config	0	37	January 16, 2025

State transition builds up and suddenly drops

Related topics