Maximum Concurrent Activities Within a Workflow

Greetings,

One of the teams at our company is working on a use case where they want to fan out a large number of activities within a single workflow in a scatter gather type fashion (sample code below)

What we observe however is that a single workflow is only able to execute about 30 activities per second no matter what worker\server knobs we turn despite resource utilization being seemingly low and 100% sync match rate (happy to share any other relevant metrics)

In the workflow event history - we see that all of the ActivityTaskScheduled events are all dispatched at the same time (presumably the SDK sends all of the commands in a single workflow task execution which is expected)

The ‘ActivityTaskStarted’ events, however, are staggered by about 10-30ms each

We tried increasing number of workers, maximum concurrent activities and pollers, and task queue partitions.

Our question is whether this is some sort of inherent limitation of the platform (i.e. all scheduled activities within a workflow must be delivered to workers sequentially) or whether there are knobs to tune we are not aware of (or if we tuned something incorrectly).
My understanding is that Temporal’s unit of concurrency is a history shard and that a given workflow always maps to the same history shard so I wonder if higher concurrency is just not possible within a single workflow.

We are currently on server version 1.21.6.
This is how we scatter-gather these activities in Python:

    @workflow.run
    async def execute(self) -> None:
        activities = [workflow.start_activity(dummy_fanout, start_to_close_timeout=timedelta(seconds=11))
            for _ in range(1000)
        ]
        await asyncio.gather(*activities)

It also looks like temporal.lock_latency.99percentile goes up to 3.3 (presumably milliseconds though not sure what unit the server emits. The code just takes ‘Duration’)

I tried searching slack and forum and did not find a definitive answer to this but let me know if I missed a post.
Thank you!

For posterity - we received a response on slack confirming that each activity execution must be recorded in the database as ‘ActivityTaskStarted’ which is done sequentially as it requires the workflow lock. This can be alleviated with local activities to an extent but they have other drawbacks. Otherwise in order to attain highest degrees of parallelization - work must be isolated by workflow run.