Starting 100K workflows at the same time caused this

My teammate tried to start 100K workflows in a for loop and all the workflow has become zombies like in the picture, and didn;t proceed or timing out. Checking the server find errors like this, any ideas what’s going on and how do we make sure high load wont result in workflow zombie state?
Zombie:


Error:

 insertId: "9wjul5iv6bvxr2vjg"  
     jsonPayload: {
      error: "context deadline exceeded"   
      level: "error"   
      logging-call-at: "workflowHandler.go:3383"   
      msg: "Unknown error"   
      service: "frontend"   
      stacktrace: "github.com/temporalio/temporal/common/log/loggerimpl.(*loggerImpl).Error
    	/temporal/common/log/loggerimpl/logger.go:138
    github.com/temporalio/temporal/service/frontend.(*WorkflowHandler).error
    	/temporal/service/frontend/workflowHandler.go:3383
    github.com/temporalio/temporal/service/frontend.(*WorkflowHandler).StartWorkflowExecution
    	/temporal/service/frontend/workflowHandler.go:494
    github.com/temporalio/temporal/service/frontend.(*DCRedirectionHandlerImpl).StartWorkflowExecution.func2
    	/temporal/service/frontend/dcRedirectionHandler.go:1114
    github.com/temporalio/temporal/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect
    	/temporal/service/frontend/dcRedirectionPolicy.go:116
    github.com/temporalio/temporal/service/frontend.(*DCRedirectionHandlerImpl).StartWorkflowExecution
    	/temporal/service/frontend/dcRedirectionHandler.go:1110
    github.com/temporalio/temporal/service/frontend.(*AccessControlledWorkflowHandler).StartWorkflowExecution
    	/temporal/service/frontend/accessControlledHandler.go:702
    github.com/temporalio/temporal/service/frontend.(*WorkflowNilCheckHandler).StartWorkflowExecution
    	/temporal/service/frontend/workflowNilCheckHandler.go:112
    go.temporal.io/temporal-proto/workflowservice._WorkflowService_StartWorkflowExecution_Handler.func1
    	/go/pkg/mod/go.temporal.io/temporal-proto@v0.23.1/workflowservice/service.pb.go:1015
    github.com/temporalio/temporal/service/frontend.interceptor
    	/temporal/service/frontend/service.go:316
    go.temporal.io/temporal-proto/workflowservice._WorkflowService_StartWorkflowExecution_Handler
    	/go/pkg/mod/go.temporal.io/temporal-proto@v0.23.1/workflowservice/service.pb.go:1017
    google.golang.org/grpc.(*Server).processUnaryRPC
    	/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1082
    google.golang.org/grpc.(*Server).handleStream
    	/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1405
    google.golang.org/grpc.(*Server).serveStreams.func1.1
    	/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:746"   
      ts: "2020-07-09T01:14:14.567Z"

It looks like you are using a local mac instance. It is not provisioned and configured for any load testing, only for development.

@maxim temporal server is hosted on GKE with multiple pods, and client/worker is from local mac. So you were saying this is client constraint? But shouldn’t server time out the workflow since they already being scheduled by the server?

1 Like

I believe it is a client as a workflow task is scheduled and not picked up by a worker. So if you run a worker on the same machine that starts the workflows it just might not have enough capacity to execute tasks for that many simultaneously started workflows. Also for larger-scale testing, a task queue with a single partition may be not enough. So you have to configure multiple partitions for it through the dynamic config.

But shouldn’t server time out the workflow since they already being scheduled by the server?

The configured timeout for your workflows is 87600 hours. So you have to wait that long for them to timeout.

1 Like

Got it, last time we chatted about how to pass in dynamic config values through kube yaml file, I can create another topic just for everyone’s sake on that topic.