Context deadline exceeded issue

Hi Team,
I am seeing context deadline exceeded error while writing the message to the queue (while scheduling the workflow). Can someone suggest any solution to resolve this issue ?
Any help is appreciated .

Thanks,
Mohit

Hi @Mohit_Sharma can you provide more info please?
Full error log and a detailed description of what client code you are running would help.

	logger.BootstrapLogger.Debug("Entering initTemporalStarter...")
	// Create the client object just once per process
	opts := client.Options{HostPort: temporalHostPort, Namespace: temporalNameSpace}
	// Create the client object just once per process
	c, err := client.NewClient(opts)
	if err != nil {
		return nil, fmt.Errorf("error while creating the temporal client:%v", err)
	}
	return c, nil
}```


Above one is the client connection code.



options := client.StartWorkflowOptions{
ID: schedule.ID,
TaskQueue: worker.LIVETaskQueue,
}
_, err := s.Client.ExecuteWorkflow(context.Background(), options, “CreateWorkflow”, schedule)
if err != nil {
return fmt.Errorf(“unable to complete Workflow:%v”, err)
}```

this the workflow schedule code.

time="2022-07-11 16:43:30" level=debug msg="Entering handler.buildGetSchemaFailureRespBody() ...map[X-Tracking-Id:fc0c5b69-8324-40be-960d-ee2094070499]"```

this is the app log and the framework log is 
```Workflow:context deadline exceeded```

Thanks for the info, can you check if your temporal frontend service is up ?

tctl --ad <temporal-frontend-address:port> cl h

Does this return “Serving”? Use the same for temporal-frontend-address:port as you set in code, re: HostPort: temporalHostPort.

bash-5.1# tctl --ad 10.119.240.10:7233 cl h
temporal.api.workflowservice.v1.WorkflowService: SERVING

yes it is returning SERVING

Thanks. Would you be able to provide the whole error log?
Are you able to start your workflow execution via tctl for example:

tctl ---ad <temporal-frontend-address:port> --namespace <namespace> wf start --tq <taskqueue> --wt CreateWorkflow

what is the “schedule” argument you are passing as input arg?

Can you check your temporal service logs to see if anything stands out? These context deadline exceeded errors do need some debugging/are not easy to figure out at times.

bash-5.1# tctl --ad 10.119.240.10:7233 --namespace live-ctrl-svc wf start --tq LIVE_TASK_QUEUE --wt CreateWorkflow
Error: Failed to create workflow.
Error Details: context deadline exceeded
Stack trace:
goroutine 1 [running]:
runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
        /usr/local/go/src/runtime/debug/stack.go:16 +0x19
go.temporal.io/server/tools/cli.printError({0x1dbb50b, 0x1a}, {0x20dbf20, 0xc0000ac120})```

@tihomir Can you please have a look ?

Context deadline exceeded errors can happen for a number of reasons, for example network issues, config issues, services being down, db issues etc, so would need more info please:

How are you deploying temporal server (docker compose, helm charts, some other way)?
Do you see any logs reported by either temporal frontend, history, matching services. Do you see any logs reported by your db?
Do you have a load balancer in front of your temporal frontend service?
Can you access the web ui? If you can does it show any errors?

Any other info that you could share that you think would be important? Were you able to start workflows before or is this a fresh cluster install?

We are still getting the context deadline exceeded error while triggering a new workflow.

We are deploying the temporal server with helm commands.

We do have a working system in another cluster with following versions.

Azure (Old and not been upgraded but Working without any issues) :
Temporal version :: 1.8.1
Kubernetes version :: v1.20.13

GCP (Working)
Temporal version :: 1.14
Kubernetes version :: v1.22.10-gke.600

GCP (Not Working) Recently created the cluster
Temporal version :: 1.16.2
Kubernetes version :: v1.22.9-gke.1500

Our Goversion was 1.12 and we updated the code to 1.16 and temporal sdk to 1.16.0 … even after that we are having same issue .

I have also attached the history/frontend pod logs or your Reference in the Google Drive
Temporal - Google Drive

Please go through the logs and let us know to mitigate the issue.

That’s a lot of logs to go through, I think typically you would go through your logs and indicate / show the errors :slight_smile:
Looked your history logs briefly and errors like:

error.",“error”:“GetVisibilityTasks operation failed. Select failed.
error.”,“error”:“UpdateShard failed. Failed to start transaction.
error.”,“error”:"GetOrCreateShard: failed to get ShardID 177

seem to indicate possible db issues (can also be network issues, might be worth checking this on your end). can you look at your db logs? what persistence are you using? what is numHistoryShards that you set in your config? Under what load do you get these errors, or is it only during pod restart?

Do you have server metrics enabled? Couple things worth checking:

Persistence latencies:
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

Visibility latencies:
histogram_quantile(0.95, sum(rate(task_latency_bucket{operation=~“VisibilityTask.*”, service_name=“history”}[1m])) by (operation, le))

Resources Exhausted:
sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)

Hope this helps.

While Debugging the issue in local with cluster temporal , I am getting the context deadline exceeded @ c.cc.Invoke() .

Any idea , why its failing at this level?

func (c *workflowServiceClient) StartWorkflowExecution(ctx context.Context, in *StartWorkflowExecutionRequest, opts ...grpc.CallOption) (*StartWorkflowExecutionResponse, error) {
	out := new(StartWorkflowExecutionResponse)
	**err := c.cc.Invoke(ctx, "/temporal.api.workflowservice.v1.WorkflowService/StartWorkflowExecution", in, out, opts...)**
	if err != nil {
		return nil, err
	}
	return out, nil
}

@maxim @tihomir @alex

Can you please help on this?

@tihomir , The issue is resolved now.

In our case, we have SQL instance is ASIA region and temporal running on USW2 region. We had VPC connection and peering enabled and the SQL IP was able to connect from temporal pod.

Though the workflow was failing consistently with Context deadline exceeded.

Later , we deployed new SQL instance in the same region in USW2 and restarted it.The workflow was able to create and execute without any issues.

Also facing this issue. error.”,“error”:"GetOrCreateShard: failed to get ShardID 177
And the numHistoryShards is 512 in my config.