Debugging Activity Timeouts

Hey Temporal community,

I found an issue from 2010 that is kind of what I’m experiencing: Temporal activity timeout issue

Our configuration consists mostly of just a startToCloseTimeout → 5 minutes.

When we trigger a huge spike of workflows, a number of them seem to hang on innocent activities that take <50ms (i.e. reading/writing to a DB).

The activity ultimately times out, after I see it enter PENDING_ACTIVITY_STATE_STARTED.

While a retry will ultimately succeed, I’m worried about scale long-term.

In some cases, I’ve seen the change committed to the DB, in others I have seen it not.

Things I’ve troubleshooted:

  • DB has no pending connections/throttling/timeouts
  • No blocked threads (from jstack)
  • Nothing notable in debug temporal logs (though maybe I’m missing something)

We are on an istio service mesh, so I wonder (as the issue was raised) if we’re sending information about the activity over a closed gRPC channel, but I’m spitballing.

Any other ideas how I can debug this?