Starting 100K workflows at the same time caused this

Steven_Sun · July 9, 2020, 7:24pm

My teammate tried to start 100K workflows in a for loop and all the workflow has become zombies like in the picture, and didn;t proceed or timing out. Checking the server find errors like this, any ideas what’s going on and how do we make sure high load wont result in workflow zombie state?
Zombie:

Error:

 insertId: "9wjul5iv6bvxr2vjg"  
     jsonPayload: {
      error: "context deadline exceeded"   
      level: "error"   
      logging-call-at: "workflowHandler.go:3383"   
      msg: "Unknown error"   
      service: "frontend"   
      stacktrace: "github.com/temporalio/temporal/common/log/loggerimpl.(*loggerImpl).Error
    	/temporal/common/log/loggerimpl/logger.go:138
    github.com/temporalio/temporal/service/frontend.(*WorkflowHandler).error
    	/temporal/service/frontend/workflowHandler.go:3383
    github.com/temporalio/temporal/service/frontend.(*WorkflowHandler).StartWorkflowExecution
    	/temporal/service/frontend/workflowHandler.go:494
    github.com/temporalio/temporal/service/frontend.(*DCRedirectionHandlerImpl).StartWorkflowExecution.func2
    	/temporal/service/frontend/dcRedirectionHandler.go:1114
    github.com/temporalio/temporal/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect
    	/temporal/service/frontend/dcRedirectionPolicy.go:116
    github.com/temporalio/temporal/service/frontend.(*DCRedirectionHandlerImpl).StartWorkflowExecution
    	/temporal/service/frontend/dcRedirectionHandler.go:1110
    github.com/temporalio/temporal/service/frontend.(*AccessControlledWorkflowHandler).StartWorkflowExecution
    	/temporal/service/frontend/accessControlledHandler.go:702
    github.com/temporalio/temporal/service/frontend.(*WorkflowNilCheckHandler).StartWorkflowExecution
    	/temporal/service/frontend/workflowNilCheckHandler.go:112
    go.temporal.io/temporal-proto/workflowservice._WorkflowService_StartWorkflowExecution_Handler.func1
    	/go/pkg/mod/go.temporal.io/temporal-proto@v0.23.1/workflowservice/service.pb.go:1015
    github.com/temporalio/temporal/service/frontend.interceptor
    	/temporal/service/frontend/service.go:316
    go.temporal.io/temporal-proto/workflowservice._WorkflowService_StartWorkflowExecution_Handler
    	/go/pkg/mod/go.temporal.io/temporal-proto@v0.23.1/workflowservice/service.pb.go:1017
    google.golang.org/grpc.(*Server).processUnaryRPC
    	/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1082
    google.golang.org/grpc.(*Server).handleStream
    	/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1405
    google.golang.org/grpc.(*Server).serveStreams.func1.1
    	/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:746"   
      ts: "2020-07-09T01:14:14.567Z"

maxim · July 9, 2020, 8:38pm

It looks like you are using a local mac instance. It is not provisioned and configured for any load testing, only for development.

Steven_Sun · July 9, 2020, 8:55pm

@maxim temporal server is hosted on GKE with multiple pods, and client/worker is from local mac. So you were saying this is client constraint? But shouldn’t server time out the workflow since they already being scheduled by the server?

maxim · July 9, 2020, 10:13pm

I believe it is a client as a workflow task is scheduled and not picked up by a worker. So if you run a worker on the same machine that starts the workflows it just might not have enough capacity to execute tasks for that many simultaneously started workflows. Also for larger-scale testing, a task queue with a single partition may be not enough. So you have to configure multiple partitions for it through the dynamic config.

But shouldn’t server time out the workflow since they already being scheduled by the server?

The configured timeout for your workflows is 87600 hours. So you have to wait that long for them to timeout.

Steven_Sun · July 9, 2020, 10:58pm

Got it, last time we chatted about how to pass in dynamic config values through kube yaml file, I can create another topic just for everyone’s sake on that topic.

Topic		Replies	Views
Mass Workflow bursts cause occasional ContextDeadlineExceeded errors Community Support typescript-sdk	9	1529	November 8, 2022
Workflows completing early Community Support workflow-options	4	869	July 16, 2021
WorkflowTaskTimedOut while running large number of activities Community Support java-sdk , timeout	3	431	February 13, 2024
Temporal UI (local) is just showing 8 running workflows at a time Community Support java-sdk	8	147	July 4, 2024
WorkflowTaskTimedOut when testing performance Community Support timeout	6	3155	September 8, 2024

Starting 100K workflows at the same time caused this

Related topics