Context deadline exceeded while initiating workflows

Hi, so we are currently using temporal in some of our services and occasionally we run into a problem wherein the workflow initiation gives context deadline exceeded. Also once a workflow with a particular wfid gives context deadline, it will always return the same error even after trying to initiate/terminate the same numerous amount of times. For example, we tried initiating/query one of these wfids via tctl and got the following response.

Temporal version: 1.15
Java sdk version: 1.18.2

Error: Failed to run workflow.
Error Details: context deadline exceeded
Stack trace:
goroutine 1 [running]:
	runtime/debug/stack.go:24 +0x64
	runtime/debug/stack.go:16 +0x1c{0x103c9a2db, 0x17}, {0x10440b860, 0x140006881c8}) +0x1c0{0x103c9a2db?, 0x22?}, {0x10440b860?, 0x140006881c8?}) +0x28 +0x148, 0x1) +0x394 +0x20{0x10415d620?, 0x1044020a8?}, 0x3?) +0x94{{0x103c6f9d8, 0x3}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x103cefe5d, 0x38}, {0x0, ...}, ...}, ...) +0x504*App).RunAsSubcommand(0x140004f9dc0, 0x140001a6f20) +0xa68{{0x103c795cc, 0x8}, {0x0, 0x0}, {0x1400011cd70, 0x1, 0x1}, {0x103c9f7be, 0x19}, {0x0, ...}, ...}, ...) +0x9c4{{0x103c795cc, 0x8}, {0x0, 0x0}, {0x1400011cd70, 0x1, 0x1}, {0x103c9f7be, 0x19}, {0x0, ...}, ...}, ...) +0x650*App).Run(0x140004f9a40, {0x1400003a750, 0xd, 0xd}) +0x7e4
	./main.go:47 +0xc0

Sorry late response, “context deadline exceeded” means some timeout happened. Do you have server metrics configured and do you have access to service logs? It’s not always clear which temporal service is timing out and on what operation so looking at errors in logs would help. From service metrics side
would start looking at service errors for different service types (frontend,matching,history), for example Grafana query for frontend:

sum(rate(service_error_with_type{service_type="frontend"}[5m])) by (error_type)

Another thing you could check is any possible resource exhausted issues. Sample Grafana query:

sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

Hey, sorry for making this thread inactive. Didn’t see any instance of resource exhaustion but have seen lots of cases of context_deadlineExceededError in case of history and serviceerror_DeadlineExceeded in case of frontend.
This is happening for all the active temporal namespaces

I am facing the same issue, do you resolve it? @Arvind_Narayanan