My team and I spent a pretty decent amount of time debugging some stuck workflows with large history length. We were seeing workflow task deadlines exceeded, but no running activities. We’re using the Datadog large payload codec for activity payloads over 128K, but we eventually discovered that concurrent activities were getting batched, and the total gRPC message size exceeded the 4M limit. We were able to get around it by adding sleeps to stagger batches of activity requests.
The problem then (besides the request batching) is that the gRPC error was completely swallowed up by the retry interceptor and we weren’t able to find it in the Temporal UI or any logs. I wonder if something changed here, because I can find a few different posts from people who hit the same problem, but were able to see it in the UI. What appears to be happening is that the message size error has a ResourceExhausted
code, which the retry middleware thinks is retryable. It holds onto the error until it’s replaced by context deadline exceeded
, which is what we see in the UI. We only found out what was actually going on after some tedious debugging.
I guess my question is: Is it possible to configure logging of retryable gRPC errors? I know Temporal is supposed to let us not worry about this stuff, but clearly sometimes we need to. I see that there’s a newer version of go-grpc-middleware that takes a callback for retries. Maybe that’s an option.