Question about Temporal worker starvation + scalability

Hi! I have a question about Temporal worker scalability, feel free to point me to the docs if something already exists around this. I have a case where I need to fan out a large number of activities (thousands) of the same type, in a single workflow invokation, and I do so using the Java SDK, something like: Promise.allOf(promisesList.values).get() . I have a few questions:

  • Will this immediately enqueue all these tasks to the task queue?
  • If I only have a single worker, can the worker process these in parallel, or will they be sequential? It looks like DEFAULT_MAX_CONCURRENT_ACTIVITY_EXECUTION_SIZE is 200 by default for a worker, does it mean that the worker can process up to 200 of these tasks in parallel (assuming no other workload on the queue)?
  • When I look at the event history, it leads me to believe that the answer to the above is that “they will be sequential” - I don’t see, for example, 200 “ActivityTaskScheduled” back to back - they mainly seem to be scheduled → started → completed, then the worker moves to the next one. I have trouble reconciling that with the config of 200 max activities?
  • Say that the code above gets executed and ~1000 activities get queued, and assuming a different workflow comes right after and invokes a single activity, am I correct in assuming that this activity will get queued behind the 1000 others, and potentially starved, thus leading to a timeout?
  • Say, hypothetically speaking (ha!) we’re in this boat, would the recommendation be to horizontally scale by adding a second worker, or vertically scale and increase max concurrent activity if our worker still has spare cpu/memory? If the latter, then I’m having trouble reconciling how that number comes into play given I don’t see much parallelism in the event history of a workflow running on a single worker.

I’m sure I’m missing something - any guidance would be appreciated, thanks!

Hi,

Will this immediately enqueue all these tasks to the task queue?

Yes, all of these tasks will be enqueued immediately

If I only have a single worker, can the worker process these in parallel, or will they be sequential? It looks like DEFAULT_MAX_CONCURRENT_ACTIVITY_EXECUTION_SIZE is 200 by default for a worker, does it mean that the worker can process up to 200 of these tasks in parallel (assuming no other workload on the queue)?

In the Java SDK concurrent activities map to java threads which implies that you can run as many concurrent (somewhat parallel) activities as allowed via the worker options.
You could simulate this pretty easily via an activity that logs, sleeps and logs again.

When I look at the event history, it leads me to believe that the answer to the above is that “they will be sequential” - I don’t see, for example, 200 “ActivityTaskScheduled” back to back - they mainly seem to be scheduled → started → completed, then the worker moves to the next one. I have trouble reconciling that with the config of 200 max activities?

Did you follow this sample? https://github.com/temporalio/samples-java/blob/main/src/main/java/io/temporal/samples/hello/HelloParallelActivity.java

You should be able to schedule activities to run in parallel and that will be reflected in the workflow history.

Say that the code above gets executed and ~1000 activities get queued, and assuming a different workflow comes right after and invokes a single activity, am I correct in assuming that this activity will get queued behind the 1000 others, and potentially starved, thus leading to a timeout?

Activity queues don’t have any ordering guarantees but you can treat them as FIFO queues.
The 1001st activity in the queue will likely be the last to execute.
In terms of timeout, we generally recommend using startToCloseTimeout which doesn’t count the time the activity was waiting for in the queue.

Say, hypothetically speaking (ha!) we’re in this boat, would the recommendation be to horizontally scale by adding a second worker, or vertically scale and increase max concurrent activity if our worker still has spare cpu/memory? If the latter, then I’m having trouble reconciling how that number comes into play given I don’t see much parallelism in the event history of a workflow running on a single worker.

Temporal workers scale horizontally quite well for a use case like yours, I’d recommend starting with the default WorkerOptions and scaling the worker fleet.

Thank you for your answer! After digging in further, I think one of the issues is that the max number of polling threads is set to 5 by default.

This explains, I think, why for example if I see 200 activities scheduled simultaneously, I don’t see an equivalent 200 activities started simultaneously. It looks like our worker isn’t managing to pull them off the queue fast enough since it only has 5 threads and the activities complete fairly quickly.

I’ll try to increase the polling threads to see if we manage to churn through larger fan outs of fast tasks more quickly!

Take a look at this thread as well as answer in this one. Think it will provide further help.

Also, consider using local activities for very short tasks.