Scaling Temporal Batch Signal Operations to Lakhs of Workflows – Best Practices and Error Handling

Jignesh_Tailor · June 25, 2025, 12:23pm

Hi Temporal Community,

I’m currently using Temporal in a production scenario where at any given point, lakhs of workflows (hundreds of thousands) are waiting for a signal to proceed and at the same time new workflows are being created.

To handle this, I’m using the StartBatchOperationRequest to send a signal to all matching workflows in a single batch operation. Here’s a simplified version of my code:

System.out.println("====== start ========");

String widPattern = WORKFLOW_ID + "-Batch-" + batchId;

StartBatchOperationRequest request =
    StartBatchOperationRequest.newBuilder()
        .setNamespace(client.getOptions().getNamespace())
        .setJobId(jobId)
        .setVisibilityQuery(
            "WorkflowType = 'GreetingWorkflow' and ExecutionStatus = 'Running' and WorkflowId STARTS_WITH '"
            + widPattern + "'")
        .setReason("signal to notify waiting requests")
        .setSignalOperation(
            BatchOperationSignal.newBuilder()
                .setSignal("waitForName")
                // .setInput(inputPayloads)
                .setIdentity(client.getOptions().getIdentity())
                .build())
        .build();

System.out.println("====== sending ========");

client.getWorkflowServiceStubs().blockingStub().startBatchOperation(request);

I have a few questions and concerns about using this approach at scale:

Scalability: How well does this scale in a production environment?
If I have 500,000 workflows waiting for a signal, will a single StartBatchOperationRequest handle them efficiently, or are there known limits and recommended patterns for scaling?
Worker Configuration: What would be the ideal number of workers in such a setup?
Should the number of workers scale linearly with the number of workflows waiting, or is there a better strategy (e.g., caching, load balancing, etc.)?
Error Handling: What happens if an exception occurs during the batch signal operation?

For example, if I have 5,000 workflows and an error occurs while signaling the first workflow, will the operation continue for the rest 4,999 workflows, or will it fail entirely?
Is the batch operation atomic or partially applied?

Monitoring and Observability: Are there recommended tools or approaches to monitor and trace the progress or status of such batch operations?

Any insights, architecture suggestions, or experience from others who have done something similar at scale would be really helpful.

Thanks in advance!

Topic		Replies	Views
Batch signal workflows from Java SDK Community Support java-sdk , signals , tctl	2	838	March 29, 2023
Batch Signals: How to send a signal to multiple running workflows? Community Support	3	1658	November 13, 2020
Send signal in batch using typescript sdk Community Support typescript-sdk	6	770	May 25, 2023
Signal Multiple Workflow Executions Community Support java-sdk , signals	2	1021	May 11, 2023
Performance of SignalWithStartWorkflow() Community Support	12	949	December 3, 2020

Scaling Temporal Batch Signal Operations to Lakhs of Workflows – Best Practices and Error Handling

Related topics