Mass Workflow bursts cause occasional ContextDeadlineExceeded errors

Question

When I start 1000+ Workflows in quick succession it results in ~0-4 ContextDeadlineExceeded failures per 1k workflows creation attempts. What can I tune to stop this from occurring? What are the best practices for launching huge bursts of workflows?

Background

There’s a few instances where I want to create 10,000+ new workflows at once.

Attempted Remedies

I’ve tried to add a 10 minute connection timeout to the WorkflowClient. Unclear if it had any effect as the ContextDeadlineExceeded errors still occur, will probably bump it to 30 minutes just in case.

import { Connection, WorkflowClient } from "@temporalio/client";

let workflowDaddy: any = Connection.connect( { connectTimeout:600000 } )
.then( (connection) => { workflowDaddy = new WorkflowClient({connection}); } ).catch(console.error);

Also switched from attempting to start them all in parallel to starting them all in series. This added a lot of extra execution time but still didn’t stop ContextDeadlineExceeded

// Start workflows all at once
let promises = []
for(entry of manyThings){
  promises.push(
    workflowDaddy.start(someWorkflow, {
      args: [entry],
      taskQueue: "workflows",
      workflowId:"[someWorkflow]~" + entry.id
     }))
}
await Promise.allSettled(promises)

// Start workflows one at a time
for(entry of manyThings){
  await workflowDaddy.start(someWorkflow, {
    args: [entry],
    taskQueue: "workflows",
    workflowId:"[someWorkflow]~" + entry.id
  })
}

Example Error + Trace

ServiceError: Failed to start Workflow
    at WorkflowClient.rethrowGrpcError (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@temporalio/client/src/workflow-client.ts:606:13)
    at WorkflowClient._startWorkflowHandler (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@temporalio/client/src/workflow-client.ts:756:12)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async WorkflowClient.start (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@temporalio/client/src/workflow-client.ts:417:19)
    at async Activity.nftFinesseFlows [as fn] (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/src/temporal/shared-activities.ts:89:4)
    at async Activity.execute (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@temporalio/worker/src/activity.ts:64:12)
    at async ActivityInboundLogInterceptor.execute (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@temporalio/worker/src/activity-log-interceptor.ts:38:14)
    at async /mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@temporalio/worker/src/activity.ts:71:24
    at async /mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@temporalio/worker/src/worker.ts:906:26 {
  cause: Error: 4 DEADLINE_EXCEEDED: context deadline exceeded
      at Object.callErrorFromStatus (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@grpc/grpc-js/src/call.ts:81:17)
      at Object.onReceiveStatus (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@grpc/grpc-js/src/client.ts:352:36)
      at /mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@grpc/grpc-js/src/call-stream.ts:206:27
      at onReceiveStatus (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@temporalio/client/src/grpc-retry.ts:176:17)
      at Object.onReceiveStatus (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@temporalio/client/src/grpc-retry.ts:180:13)
      at InterceptingListenerImpl.onReceiveStatus (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@grpc/grpc-js/src/call-stream.ts:202:19)
      at Object.onReceiveStatus (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@grpc/grpc-js/src/client-interceptors.ts:462:34)
      at Object.onReceiveStatus (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@grpc/grpc-js/src/client-interceptors.ts:424:48)
      at /mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@grpc/grpc-js/src/call-stream.ts:330:24
      at processTicksAndRejections (node:internal/process/task_queues:78:11)
  for call at
      at ServiceClientImpl.makeUnaryRequest (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@grpc/grpc-js/src/client.ts:324:26)
      at Service.rpcImpl (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@temporalio/client/src/connection.ts:340:21)
      at Service.rpcCall (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/protobufjs/src/rpc/service.js:94:21)
      at executor (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@protobufjs/aspromise/index.js:44:16)
      at new Promise (<anonymous>)
      at Object.asPromise (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@protobufjs/aspromise/index.js:28:12)
      at Service.rpcCall (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/protobufjs/src/rpc/service.js:86:21)
      at Service.startWorkflowExecution (eval at Codegen (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@protobufjs/codegen/index.js:50:33), <anonymous>:4:15)
      at WorkflowClient._startWorkflowHandler (/mnt/SSD/Sorcery/Pulsr/pulsr-temporal/node_modules/@temporalio/client/src/workflow-client.ts:746:46)
      at runMicrotasks (<anonymous>) {
    code: 4,
    details: 'context deadline exceeded',
    metadata: Metadata { internalRepr: [Map], options: {} }
  }
}

This seems like a server issue, which server are you using? Our docker-compose setup?
I’d check the server logs.

1 Like

There’s a few instances where I want to create 10,000+ new workflows at once.

Each client request is a grpc call to your frontend service, frontend has per-namespace rps limits (frontend.namespaceRps in your dynamic config), see this post for configuration info.

Temporal has also rps limits:
frontend.rps - frontend overall rps limit
history.rps - history rps limit
matching.rps- matching rps limit

as well as persistence qps limits you can set:

frontend.persistenceMaxQPS - frontend persistence max qps
history.persistenceMaxQPS - history persistence max qps
matching.persistenceMaxQPS- matching persistence max qps

Would check your frontend logs for messages like namespace rps exceeded.

For server metrics check:

sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)

You should need to also watch your persistence latencies (server metrics):

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

1 Like

Using your default docker compose up setup (PostgreSQL and Elasticsearch) with no custom configs. The failures do seem to cause server issues, (assuming I’m looking at the right log). Various errors I’ve witnessed after a Context-Deadline-Exceeded include:
"Skipping duplicate ES request for visibility task key", "Fail to process task", "Persistent store operation failure", "transaction rollback error", "Persistent fetch operation Failure"
Some of those errors are long afterwards though and possibly unrelated. Gonna try to capture the log around the time a context-deadline-exceeded error occurs and post back here.

Feels reminiscent of [Bug] Maru Spike Test Corrupts AWS RDS Postgresql DB · Issue #3131 · temporalio/temporal · GitHub

The default docker compose (assume you use auto-setup image) is not suited to do any kind of load testing as it starts all temporal services in same container, process (same as temporalite).

If you have to load test on docker, would recommend service-per-container setup (see here to get idea) at least then you would be able to scale individual services if needed during your testing.

Also not sure you should test with persistence set up via docker compose, you might want to run load tests using some sort of production db setup as well.

1 Like

The default docker compose (assume you use auto-setup image) is not suited to do any kind of load testing as it starts all temporal services in same container, process (same as temporalite).

This makes a lot of sense, I switched my local deploy to my-temporal-dockercompose (w/o auto setup). Still get ContextDeadlineExceeded errors when workflow creation throughput is high.

Can I increase the number of instance for some service to achieve higher workflow creation throughput and/or is there a better way to increase throughput?


EDIT: As it stands, If I want to start 10,000 workflows it takes 30+ minutes to do them one at a time (only a few fail). If I try to start all 10,000 workflows in parallel than only 900 succeed and 9100 fail as ContextDeadlineExcedeeded.

Is this an expected performance limitation of default config from my-temporal-dockercompose (w/o auto setup)? Just want to know if it’s even worth debugging this locally or if I should deploy a real cluster and expect these issues to clear up

Yes, this is a limitation of the configuration of the local cluster. The real clusters should be able to handle almost any load if configured correctly.

1 Like

Okay so I’ve looked at:
The Docker-Compose K8s
The Helm-Charts
The Production Deployment Docs
The Kubernetes Self-Deploy Article

I get the sense that neither the helm-charts or the k8 provide a production ready cluster config?


This is what I’d plan for a production cluster:

  1. New GKE cluster
  2. Install Cassandra to the cluster (w/persistence + auto-scale)
    a. Some type of setup so the DB is ready to be used by temporal
  3. Install dedicated pods for the four temporal services (Frontend/Matching/History/Worker)
    a. Some type of horizontal auto-scale policy for these.
  4. Ensure Cassandra and Temporal services can all connect with one another
  5. Figure out ingress/egress and maybe one last pod to host Web-UIs

Does above sound right? Or can I just chuck on the K8s onto a cluster in the cloud and be somewhat set?

kubectl create namespace temporal
kubectl apply -n temporal -R -f https://raw.githubusercontent.com/temporalio/docker-compose/main/k8s/temporal-cass.yaml

The above sounds right. Another option is to use Temporal Cloud.

1 Like

Thanks @maxim,

Plan to try out a proper cluster this week on GKE, would also love to try out TemporalCloud and have submitted a cloud-access request.


As an aside, re: docker-compose configs. Had much better luck with the cassandra only config docker-compose -f docker-compose-cass.yml up as opposed to the default postgres/elastic flavor. The cassandra only setup allowed for me to start 10,000+ workflows w/o any of the errors, system lockup I experienced with the default setup.