Scaling Temporal Batch Signal Operations to Lakhs of Workflows – Best Practices and Error Handling

Hi Temporal Community,

I’m currently using Temporal in a production scenario where at any given point, lakhs of workflows (hundreds of thousands) are waiting for a signal to proceed and at the same time new workflows are being created.

To handle this, I’m using the StartBatchOperationRequest to send a signal to all matching workflows in a single batch operation. Here’s a simplified version of my code:

System.out.println("====== start ========");

String widPattern = WORKFLOW_ID + "-Batch-" + batchId;

StartBatchOperationRequest request =
    StartBatchOperationRequest.newBuilder()
        .setNamespace(client.getOptions().getNamespace())
        .setJobId(jobId)
        .setVisibilityQuery(
            "WorkflowType = 'GreetingWorkflow' and ExecutionStatus = 'Running' and WorkflowId STARTS_WITH '"
            + widPattern + "'")
        .setReason("signal to notify waiting requests")
        .setSignalOperation(
            BatchOperationSignal.newBuilder()
                .setSignal("waitForName")
                // .setInput(inputPayloads)
                .setIdentity(client.getOptions().getIdentity())
                .build())
        .build();

System.out.println("====== sending ========");

client.getWorkflowServiceStubs().blockingStub().startBatchOperation(request);

I have a few questions and concerns about using this approach at scale:

  1. Scalability: How well does this scale in a production environment?
    If I have 500,000 workflows waiting for a signal, will a single StartBatchOperationRequest handle them efficiently, or are there known limits and recommended patterns for scaling?
  2. Worker Configuration: What would be the ideal number of workers in such a setup?
    Should the number of workers scale linearly with the number of workflows waiting, or is there a better strategy (e.g., caching, load balancing, etc.)?
  3. Error Handling: What happens if an exception occurs during the batch signal operation?
  • For example, if I have 5,000 workflows and an error occurs while signaling the first workflow, will the operation continue for the rest 4,999 workflows, or will it fail entirely?
  • Is the batch operation atomic or partially applied?
  1. Monitoring and Observability: Are there recommended tools or approaches to monitor and trace the progress or status of such batch operations?

Any insights, architecture suggestions, or experience from others who have done something similar at scale would be really helpful.

Thanks in advance! :folded_hands: