Temporal frontend closing the connections causing 499 errors at nginx ingress

Hi All,

We have deployed temporal 1.20.2 version on k8s. We are facing an issue i.e 499 error code at nginx which is routing calls to Temporal frontend service.

Nginx Error:

  • [22/May/2023:22:26:48 +0000] “POST /temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue HTTP/2.0” 499 0 “-” “grpc-go/1.52.3” 507 30.004

nginx debug log:
2023/05/22 21:52:14 [info] 25#25: *37189 client canceled stream 5 while sending request to upstream, client: <>, server: temporal-frontend, request: “POST /temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue HTTP/2.0”, upstream: “<>”, host: “<>”

2023/05/22 21:52:14 [debug] 25#25: *37189 http run request: “/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue?”

2023/05/22 21:52:14 [debug] 25#25: *37189 http upstream check client, write event:0, “/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue”

2023/05/22 21:52:14 [debug] 25#25: *37189 finalize http upstream request: 499

I have tried passing ‘0’ for frontend.keepAliveMaxConnectionAge to disable it which did not work,. Any ideas on how to fix this issue. Not seeing any functionality issue but seeing so many these errors in nginx logs which looks like we are getting these when idle connections terminated by frontend service. Also no errors/logs in frontend service. Appreciate the help here.

1 Like

anyone faced this issue? looks like it is coming with default setup.

HTTP/2.0” 499

From what I understand most common use case for this is when client closes connection.

Your workers establish long-poll connections for workflow and activity tasks to the frontend service. These connections have a timeout (default 70s for Go SDK). Service dispatches workflow/activity tasks to your pollers. If there are no tasks to be dispatched to a poller (long-poll connection) within its timeout worker would time out this connection and start it back up.
From the description sounds like this could be reason why you see these errors. To verify you could look at poll_timeouts server metric. this gets recorded when long-poll request from worker reaches timeout without being dispatched a task to process.

Just to add, having some poll timeouts is actually desired, low poll_timeouts could indicate that you might not have enough pollers in some cases, so if this is the reason for the 499 i think its ok to ignore.

Thanks @tihomir for checking on this and for explanation. Yeah I see poll_timeout metric showing high values. Looks like we have way less tasks. We seeing so many of these logs, Is it possible to minimize the log by increasing poll timeout in our case? Could you suggest right properties to tune this?