Problems emitting 10,000 activity tasks in a workflow

sdonovan · October 8, 2020, 5:56pm

Greetings. Hoping to advice for a workflow that quickly emits 10,000 activity tasks. Each task represents an S3 multi-part (up to 10,000). I’m using temporal with docker-compose and assigned 16Gb. The tasks are emitted calling ResponseWorkflowTaskCompleted, but, the call to the server exceeds timeout (the default of 10 seconds). I don’t think it’s because the message-size is exceeded, there is nothing in the logs to indicate that.

Yes, I could recode to call child-workflows to reduce the activity count, but it just adds complexity.

My question is two fold. First, I tried increasing the RpcTimeout to 60 seconds, but it appears to make no difference, it still uses 10 seconds (yes, I checked in the debugger to ensure it’s being set in GrpcDeadlineInterceptor) – any ideas (I still need to debug deeper)? Second, what configuration changes would be needed to speed up the server accepting the message? What’s reasonable – should it even accept 10,000 tasks in < 10s? For example, more Cassandra instances/partitions/whatever? Is there any documentation on how to size the installation, what metrics to track, etc.?

Many thanks,

Sean

samar · October 8, 2020, 7:05pm

Temporal scales horizontally on number of workflow executions but single execution has limited scalability characteristics. Do you really need one one activity for each S3 multi-part task? For these kind of use cases we typically recommend to use long running activities which heartbeats back to the server. Please look at split-merge which is modeled along similar lines.

If that does not work for your use case then I recommend breaking it into multiple child workflows.

sdonovan · October 8, 2020, 7:56pm

Thanks for the reply. I realize that 10,000 activities is a stretch. Yes, I have considered the long running activity solution, but, it requires retry logic within the activity, and was trying to 'punt that to Temporal. A variation would be to batch some tasks in one activity call, and if an error occurs, return the state/progress back to the workflow, and have it retry it – though that would get complex. So, we’ll probably start with the child workflow approach, I see no reason it wouldn’t work.

S

samar · October 9, 2020, 3:11pm

You can schedule the activity with retry policy and have workflow retry any activity failures. If the long running activity is heartbeating you can also include progress on the heartbeat. This allows the activity to continue from previous progress on retries. Please look retry activity sample.

Topic		Replies	Views
Question about Temporal worker starvation + scalability Community Support java-sdk	4	2264	January 26, 2022
Activity in scheduled state for very long Community Support go-sdk , cadence	3	634	September 16, 2020
Getting started with workflow concepts (parallel activities tracking) Community Support tracing	2	4505	January 3, 2022
Use-cases and questions Community Support	4	3216	January 5, 2021
Workflow task timed out on GKE Community Support java-sdk , cassandra , metrics	6	1103	June 8, 2022

Problems emitting 10,000 activity tasks in a workflow

Related topics