Context deadline exceeded while initiating workflows

Hi, so we are currently using temporal in some of our services and occasionally we run into a problem wherein the workflow initiation gives context deadline exceeded. Also once a workflow with a particular wfid gives context deadline, it will always return the same error even after trying to initiate/terminate the same numerous amount of times. For example, we tried initiating/query one of these wfids via tctl and got the following response.

Temporal version: 1.15
Java sdk version: 1.18.2

Error: Failed to run workflow.
Error Details: context deadline exceeded
Stack trace:
goroutine 1 [running]:
runtime/debug.Stack()
	runtime/debug/stack.go:24 +0x64
runtime/debug.PrintStack()
	runtime/debug/stack.go:16 +0x1c
github.com/temporalio/tctl/cli_curr.printError({0x103c9a2db, 0x17}, {0x10440b860, 0x140006881c8})
	github.com/temporalio/tctl/cli_curr/util.go:393 +0x1c0
github.com/temporalio/tctl/cli_curr.ErrorAndExit({0x103c9a2db?, 0x22?}, {0x10440b860?, 0x140006881c8?})
	github.com/temporalio/tctl/cli_curr/util.go:404 +0x28
github.com/temporalio/tctl/cli_curr.startWorkflowHelper.func2()
	github.com/temporalio/tctl/cli_curr/workflowCommands.go:238 +0x148
github.com/temporalio/tctl/cli_curr.startWorkflowHelper(0x140001a7760, 0x1)
	github.com/temporalio/tctl/cli_curr/workflowCommands.go:261 +0x394
github.com/temporalio/tctl/cli_curr.RunWorkflow(...)
	github.com/temporalio/tctl/cli_curr/workflowCommands.go:179
github.com/temporalio/tctl/cli_curr.newWorkflowCommands.func4(0x140002cca70?)
	github.com/temporalio/tctl/cli_curr/workflow.go:65 +0x20
github.com/urfave/cli.HandleAction({0x10415d620?, 0x1044020a8?}, 0x3?)
	github.com/urfave/cli@v1.22.10/app.go:526 +0x94
github.com/urfave/cli.Command.Run({{0x103c6f9d8, 0x3}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x103cefe5d, 0x38}, {0x0, ...}, ...}, ...)
	github.com/urfave/cli@v1.22.10/command.go:173 +0x504
github.com/urfave/cli.(*App).RunAsSubcommand(0x140004f9dc0, 0x140001a6f20)
	github.com/urfave/cli@v1.22.10/app.go:405 +0xa68
github.com/urfave/cli.Command.startApp({{0x103c795cc, 0x8}, {0x0, 0x0}, {0x1400011cd70, 0x1, 0x1}, {0x103c9f7be, 0x19}, {0x0, ...}, ...}, ...)
	github.com/urfave/cli@v1.22.10/command.go:378 +0x9c4
github.com/urfave/cli.Command.Run({{0x103c795cc, 0x8}, {0x0, 0x0}, {0x1400011cd70, 0x1, 0x1}, {0x103c9f7be, 0x19}, {0x0, ...}, ...}, ...)
	github.com/urfave/cli@v1.22.10/command.go:102 +0x650
github.com/urfave/cli.(*App).Run(0x140004f9a40, {0x1400003a750, 0xd, 0xd})
	github.com/urfave/cli@v1.22.10/app.go:277 +0x7e4
main.main()
	./main.go:47 +0xc0

Sorry late response, “context deadline exceeded” means some timeout happened. Do you have server metrics configured and do you have access to service logs? It’s not always clear which temporal service is timing out and on what operation so looking at errors in logs would help. From service metrics side
would start looking at service errors for different service types (frontend,matching,history), for example Grafana query for frontend:

sum(rate(service_error_with_type{service_type="frontend"}[5m])) by (error_type)

Another thing you could check is any possible resource exhausted issues. Sample Grafana query:

sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

Hey, sorry for making this thread inactive. Didn’t see any instance of resource exhaustion but have seen lots of cases of context_deadlineExceededError in case of history and serviceerror_DeadlineExceeded in case of frontend.
This is happening for all the active temporal namespaces

I am facing the same issue, do you resolve it? @Arvind_Narayanan