Low scheduled-to-start + activity execution but super high end-to-end latency

When I built out the charts for activity execution, I noticed scheduled-to-start + activity execution are pretty low but end-to-end latency is super high.

In the log, I can see a number of DEADLINE_EXCEEDED where workflow or activity fails to report result to the Temporal cluster.

2025-03-06 16:27:17.997  WARN 1 --- [Workflow Executor taskQueue="MarketingExecutionWorkflowTaskQueue", namespace="lifecycle-prod": 277] i.t.i.w.WorkflowWorker                   --- [Workflow Executor taskQueue="MarketingExecutionWorkflowTaskQueue", namespace="lifecycle-prod": 277raceId]: Failure while reporting workflow progress to the server. If seen continuously the workflow might be stuck. WorkflowId=marketing-workflow-135-75439915, RunId=c3dfcc4b-a91b-4891-b4d9-fa95a32a1ad2, startedEventId=129

io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 29.999953340s. [closed=[], open=[[remote_addr=temporalapi.okg.com/10.2.1.170:443]]]

I increased the RPC timeout in WorkflowServiceStubOptions to 30s previously which matches the number I saw in the log.

I’d be very grateful if you could tell me where the bottleneck might be on the Temporal cluster that can cause end-to-end latency to be so high.

Are you starting many activities in parallel from a single workflow execution? The activity completion requests to the same workflow ID compete for the DB lock.

Hi maxim. Thanks for your reply.

I attached a picture of my workflow history above. I don’t have any parallel activities in this particular workflow type.

At 10:40:37, my FindExecutablePath activity got scheduled. Although execution seems fast as start & end are 10:41:02.85 and 10:41:02.87, I actually got 4 consecurive DEADLINE_EXCEEDED error in my log. Note that I changed RPC timeout to 9s after creating this post.

2025-03-09 10:41:54.233  WARN 1 --- [Activity Executor taskQueue="MarketingExecutionalActivityTaskQueue", namespace="lifecycle-prod": 24034] i.t.i.w.ActivityWorker                   --- [Activity Executor taskQueue="MarketingExecutionalActivityTaskQueue", namespace="lifecycle-prod": 24034raceId]: Failure during reporting of activity result to the server. ActivityId = 6c873e7f-6fcb-314b-b15b-24b0863223db, ActivityType = FindExecutablePath, WorkflowId=marketing-workflow-133-75556914, WorkflowType=MarketingExecutionWorkflow, RunId=d1d31b44-f80c-4b41-8d1a-bd759f965428

io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 8.999947702s. [closed=[], open=[[remote_addr=temporalapi.okg.com/10.2.1.170:443]]]

I learned 2 other things today after studying the charts from last week.

  1. When I ran a load test with 100k workflows, they completed in 15mins. During this 15 mins, end-to-end latency dropped to near 0ms and then went up again afterward.

I think it means when the sample size is big enough, the latency is actually closer to 0ms. However, during normal period of low traffic, the RPC connection is quite bad.

  1. Previously, we used to use Local activity for FindExecutablePath as it’s a read from local cache, which is super fast. However, the workflow task kept getting timeout and got stuck. Since Local activity was totally unusable, I switched to Remote activity and my workflows can make some progress with occasional timeouts. I didn’t know why back then but now I guess it’s also because of this RPC connection problem.