Workflow state transition count doesn't match history

I have a workflow execution that failed after one of its activities failed on a ScheduleToClose timeout of 3 days. There are about 15 activities in the workflow, each activity is listed with attempts:1 or attempts:2, however the workflow summary shows “State Transitions: 4536”. Generally we see about 90 state transitions when the workflow completes normally. The timeline view shows 3 days spent on the activity with the timeout failure, but only two attempts.

Each activity attempt should be quick, it’s polling for the existence of some files, and throwing an exception if they’re not there. A ScheduleToClose timeout would make sense with this activity if there were over 4000 attempts.

We do see many of our activity failures come from unregistering the heartbeater. We are using ProcessPoolExecutor to run sync Python tasks (and are considering moving to ThreadPoolExecutor to maybe avoid these heartbeat exceptions).

What might cause this discrepancy? I suppose maybe heartbeats could inflate the count?

Details:
Temporal Server version: 1.22.1
Temporal UI version: 2.21.0
Python SDK version: 1.6.0
Python version: 3.8
Sync tasks run with ProcessPoolExecutor

Does any of the activities heartbeat? Each heartbeat consumes a state transition as well.

Yes, they do. Successful workflow runs only have 80-90 state transitions, and we have some aggressive heartbeats that would easily push way past that. So it seemed like generally heartbeats might not have been counting against state transitions.

What I failed to consider is the max_heartbeat_throttle_interval defaults to 60 seconds. 3 day timeout * 24 hours * 60 minutes = 4320 heartbeats, which is close to what we see when we hit the 3 day ScheduleToClose timeout on the activity.

This is good news! Assuming throttled heartbeats don’t count towards state transitions, then counts are adding up. It’s a problem with our activity and not the platform.

Thank you so much for your reply and for making Temporal. There’s a lot to learn, but our systems are getting increasingly robust as we gain experience.

1 Like