Workflows Remain Stuck on the Same Worker Pod After ContinueAsNew and Fail with StartToCloseTimeout

I am experiencing an issue with Temporal workflows. Here’s the scenario:

I have a parent workflow that spawns 25 child workflows in parallel. Each child workflow runs an activity and, after completing the activity, uses ContinueAsNew to create a new instance of itself. I have 4 worker pods processing these workflows, but I noticed that:

  1. After calling ContinueAsNew, the child workflow often stays bound to the same worker pod instead of being picked up by another pod.
  2. Many workflows get stuck at the activity execution stage, specifically at the ActivityTaskScheduled step, seemingly waiting for their turn, and eventually fail with a timeout (StartToCloseTimeout).

My expectation is that after ContinueAsNew, the new workflow instance would be distributed to any available worker pod. However, this doesn’t seem to be happening. It appears that Temporal is not balancing workflows effectively across the pods.

Here are my questions:

  1. Is it expected behavior for workflows to remain on the same worker pod after calling ContinueAsNew?
  2. How can I configure Temporal or the worker options to ensure better load distribution across worker pods, especially after ContinueAsNew?
  3. Could this behavior be related to Sticky Execution or task queue configuration, and how should I address it?

Parent Workflow

class BatchProcessWorkflow
{
    use ActivityOptionTrait;

    public const WORKFLOW_NAME = 'BatchProcess';

    #[Workflow\WorkflowMethod(self::WORKFLOW_NAME)]
    public function run(): \Generator
    {
        $activity = Workflow::newActivityStub(
            FetchIdsActivity::class,
            $this->activityOptions(
                'default',
                timeout: CarbonInterval::year(),
                retryAttempts: 1
            ),
        );

        $ids = yield $activity->fetch();

        $childWorkflows = [];
        foreach ($ids as $id) {
            $childWorkflows[] = Workflow::executeChildWorkflow(
                SingleProcessWorkflow::WORKFLOW_NAME,
                [
                    $id,
                ],
                Workflow\ChildWorkflowOptions::new()
            );
        }

        yield Promise::all($childWorkflows);
    }
}

ChildWorkflow

#[Workflow\WorkflowInterface]
class SingleProcessWorkflow
{
    use ActivityOptionTrait;

    public const WORKFLOW_NAME = 'SingleProcess';

    #[Workflow\WorkflowMethod(self::WORKFLOW_NAME)]
    public function run(int $id): \Generator
    {
        $activity = Workflow::newActivityStub(
            ProcessActivity::class,
            $this->activityOptions(
                'default',
                timeout: CarbonInterval::minute(5),
                retryAttempts: 1,
            ),
        );

        $isProcessed = yield $activity->process($id);

        if ($isProcessed === false) {
            /** @psalm-suppress TooManyTemplateParams */
            yield Workflow::timer(CarbonInterval::hour());
        }

        yield Workflow::newContinueAsNewStub(self::class)->run($id);
    }
}

Hello. Could you provide the Workflow history before and after ContinueAsNew?

Is it expected behavior for workflows to remain on the same worker pod after calling ContinueAsNew ?

No, child workflow is new execution and its first workflow task would be dispatched from non-sticky (not worker specific) task queue.

How can I configure Temporal or the worker options to ensure better load distribution across worker pods, especially after ContinueAsNew?

How many workers do you have and whats worker options? Task distribution from non-sticky task queue is pretty even across workers, if you have enough workflow task pollers to saturate task queue partitions.
Are you scaling your workers?

Could this behavior be related to Sticky Execution or task queue configuration, and how should I address it?

Not in this case as each initial start of child workflow and their continued executions first workflow task would not be placed on worker sticky task queue. Once its dispatched to a worker then yes, service would try to dispatch subsequent workflow tasks on worker sticky task queue to which first one was dispatched to.