Hey Temporal community,
I found an issue from 2010 that is kind of what I’m experiencing: Temporal activity timeout issue
Our configuration consists mostly of just a startToCloseTimeout → 5 minutes.
When we trigger a huge spike of workflows, a number of them seem to hang on innocent activities that take <50ms (i.e. reading/writing to a DB).
The activity ultimately times out, after I see it enter PENDING_ACTIVITY_STATE_STARTED.
While a retry will ultimately succeed, I’m worried about scale long-term.
In some cases, I’ve seen the change committed to the DB, in others I have seen it not.
Things I’ve troubleshooted:
- DB has no pending connections/throttling/timeouts
- No blocked threads (from jstack)
- Nothing notable in debug temporal logs (though maybe I’m missing something)
We are on an istio service mesh, so I wonder (as the issue was raised) if we’re sending information about the activity over a closed gRPC channel, but I’m spitballing.
Any other ideas how I can debug this?