WorkflowTaskTimedOut while running large number of activities

Hello,
We are seeing the following error when I run workflows with large number of activities

We are running the activities using iterator pattern in batches. One run of the workflow will work on 1000 activities and then continue as new. As we start scaling to around 30k activities we start seeing this -

io.grpc.StatusRuntimeException: NOT_FOUND: Workflow task not found.

The workflow task is then placed on another worker host and we see the same logs there too. What can we change to prevent this ?

Could you check the worker log for more messages?

Try decreasing each iteration to less than 1k.

Hey!

The worker prints the first few logs in the workflow and then after some time we see the “Failure while reporting workflow progress to the server” log

I see this -

2024/02/12 06:08:05.643 WARN [WorkflowWorker] [Workflow Executor taskQueue="PWI_TASK_QUEUE", namespace="trusttools_mat_poc": 1] [mat-worker] [] Failure while reporting workflow progress to the server. If seen continuously the workflow might be stuck. WorkflowId=MAT_WF.PASSWORD_INVALIDATION.MASS_ACTION_BATCH.T-1707718022051.JOB_ID.128, RunId=da56afcb-6033-4eda-8f71-49d80921e70c, startedEventId=3
io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: grpc: received message larger than max (107019200 vs. 4194304)
        at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271) ~[io.grpc.grpc-stub-1.52.1.jar:1.52.1]
        at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252) ~[io.grpc.grpc-stub-1.52.1.jar:1.52.1]
        at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165) ~[io.grpc.grpc-stub-1.52.1.jar:1.52.1]
        at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.respondWorkflowTaskCompleted(WorkflowServiceGrpc.java:3764) ~[io.temporal.temporal-serviceclient-1.18.1.jar:?]
        at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.lambda$sendTaskCompleted$0(WorkflowWorker.java:370) ~[io.temporal.temporal-sdk-1.18.1.jar:?]
        at io.temporal.internal.retryer.GrpcSyncRetryer.retry(GrpcSyncRetryer.java:69) ~[io.temporal.temporal-serviceclient-1.18.1.jar:?]
        at io.temporal.internal.retryer.GrpcRetryer.retryWithResult(GrpcRetryer.java:60) ~[io.temporal.temporal-serviceclient-1.18.1.jar:?]
        at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.sendTaskCompleted(WorkflowWorker.java:365) ~[io.temporal.temporal-sdk-1.18.1.jar:?]
        at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:260) ~[io.temporal.temporal-sdk-1.18.1.jar:?]
        at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:189) ~[io.temporal.temporal-sdk-1.18.1.jar:?]
        at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:103) ~[io.temporal.temporal-sdk-1.18.1.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]

and during another run I saw this

2024/02/12 07:16:15.132 INFO [AbstractIteratorWorkflow] [workflow-method-MAT_WF.PASSWORD_INVALIDATION.MASS_ACTION_BATCH.T-1...-efc89d5b-cec4-4381-824b-c49ac00b4dc8] [mat-worker] [] [MAT_WF.PASSWORD_INVALIDATION.ORCHESTRATION.T-1707722171916.JOB_ID.129] processing batch
2024/02/12 07:16:15.133 INFO [AbstractIteratorWorkflow] [workflow-method-MAT_WF.PASSWORD_INVALIDATION.MASS_ACTION_BATCH.T-1...-efc89d5b-cec4-4381-824b-c49ac00b4dc8] [mat-worker] [] [MAT_WF.PASSWORD_INVALIDATION.ORCHESTRATION.T-1707722171916.JOB_ID.129] processing 1000 DTOs
2024/02/12 07:16:29.895 INFO [MetricsEventTrackerImpl] [pool-4-thread-1] [mat-worker] [] ServiceMetricsEvent: Event Count: 1 Sensor Count: 280 Event Interval: 60000 Creation Time: 51 Send Time: 0 Total Time: 51}
2024/02/12 07:17:18.469 WARN [WorkflowWorker] [Workflow Executor taskQueue="PWI_TASK_QUEUE", namespace="trusttools_mat_poc": 2] [mat-worker] [] Failure while reporting workflow progress to the server. If seen continuously the workflow might be stuck. WorkflowId=MAT_WF.PASSWORD_INVALIDATION.MASS_ACTION_BATCH.T-1707722174589.JOB_ID.129, RunId=efc89d5b-cec4-4381-824b-c49ac00b4dc8, startedEventId=3
io.grpc.StatusRuntimeException: NOT_FOUND: Workflow task not found.
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271) ~[io.grpc.grpc-stub-1.52.1.jar:1.52.1]
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252) ~[io.grpc.grpc-stub-1.52.1.jar:1.52.1]
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165) ~[io.grpc.grpc-stub-1.52.1.jar:1.52.1]
	at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.respondWorkflowTaskCompleted(WorkflowServiceGrpc.java:3764) ~[io.temporal.temporal-serviceclient-1.18.1.jar:?]
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.lambda$sendTaskCompleted$0(WorkflowWorker.java:370) ~[io.temporal.temporal-sdk-1.18.1.jar:?]
	at io.temporal.internal.retryer.GrpcSyncRetryer.retry(GrpcSyncRetryer.java:69) ~[io.temporal.temporal-serviceclient-1.18.1.jar:?]
	at io.temporal.internal.retryer.GrpcRetryer.retryWithResult(GrpcRetryer.java:60) ~[io.temporal.temporal-serviceclient-1.18.1.jar:?]
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.sendTaskCompleted(WorkflowWorker.java:365) ~[io.temporal.temporal-sdk-1.18.1.jar:?]
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:260) ~[io.temporal.temporal-sdk-1.18.1.jar:?]
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:189) ~[io.temporal.temporal-sdk-1.18.1.jar:?]
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:103) ~[io.temporal.temporal-sdk-1.18.1.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

After this error is seen the workflow is shifted to another worker and we see the same sequence of logs there too.

Will try decreasing each iteration to see if that fixes the problem

gRPC request has a 4 MB limit. My guess is that your activities have large inputs. When you schedule 1k of them in parallel all their inputs together exceed this limit.

The solution is to schedule them in batches. For example, start 50 (or whatever number that doesn’t exceed the limit) and then call Workflow.sleep(100ms) to complete the workflow task. And then iterate until all 1k are started.