I have a workflow execution that failed after one of its activities failed on a ScheduleToClose timeout of 3 days. There are about 15 activities in the workflow, each activity is listed with attempts:1 or attempts:2, however the workflow summary shows “State Transitions: 4536”. Generally we see about 90 state transitions when the workflow completes normally. The timeline view shows 3 days spent on the activity with the timeout failure, but only two attempts.
Each activity attempt should be quick, it’s polling for the existence of some files, and throwing an exception if they’re not there. A ScheduleToClose timeout would make sense with this activity if there were over 4000 attempts.
We do see many of our activity failures come from unregistering the heartbeater. We are using ProcessPoolExecutor to run sync Python tasks (and are considering moving to ThreadPoolExecutor to maybe avoid these heartbeat exceptions).
What might cause this discrepancy? I suppose maybe heartbeats could inflate the count?
Details:
Temporal Server version: 1.22.1
Temporal UI version: 2.21.0
Python SDK version: 1.6.0
Python version: 3.8
Sync tasks run with ProcessPoolExecutor