Temporal frontend closing the connections causing 499 errors at nginx ingress

anilvasi · May 22, 2023, 10:31pm

Hi All,

We have deployed temporal 1.20.2 version on k8s. We are facing an issue i.e 499 error code at nginx which is routing calls to Temporal frontend service.

Nginx Error:

[22/May/2023:22:26:48 +0000] “POST /temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue HTTP/2.0” 499 0 “-” “grpc-go/1.52.3” 507 30.004

nginx debug log:
2023/05/22 21:52:14 [info] 25#25: *37189 client canceled stream 5 while sending request to upstream, client: <>, server: temporal-frontend, request: “POST /temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue HTTP/2.0”, upstream: “<>”, host: “<>”

2023/05/22 21:52:14 [debug] 25#25: *37189 http run request: “/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue?”

2023/05/22 21:52:14 [debug] 25#25: *37189 http upstream check client, write event:0, “/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue”

2023/05/22 21:52:14 [debug] 25#25: *37189 finalize http upstream request: 499

I have tried passing ‘0’ for frontend.keepAliveMaxConnectionAge to disable it which did not work,. Any ideas on how to fix this issue. Not seeing any functionality issue but seeing so many these errors in nginx logs which looks like we are getting these when idle connections terminated by frontend service. Also no errors/logs in frontend service. Appreciate the help here.

anilvasi · May 24, 2023, 5:44pm

anyone faced this issue? looks like it is coming with default setup.

tihomir · May 24, 2023, 10:44pm

HTTP/2.0” 499

From what I understand most common use case for this is when client closes connection.

Your workers establish long-poll connections for workflow and activity tasks to the frontend service. These connections have a timeout (default 70s for Go SDK). Service dispatches workflow/activity tasks to your pollers. If there are no tasks to be dispatched to a poller (long-poll connection) within its timeout worker would time out this connection and start it back up.
From the description sounds like this could be reason why you see these errors. To verify you could look at poll_timeouts server metric. this gets recorded when long-poll request from worker reaches timeout without being dispatched a task to process.

Just to add, having some poll timeouts is actually desired, low poll_timeouts could indicate that you might not have enough pollers in some cases, so if this is the reason for the 499 i think its ok to ignore.

anilvasi · May 25, 2023, 1:09am

Thanks @tihomir for checking on this and for explanation. Yeah I see poll_timeout metric showing high values. Looks like we have way less tasks. We seeing so many of these logs, Is it possible to minimize the log by increasing poll timeout in our case? Could you suggest right properties to tune this?

Topic		Replies	Views
Context deadline exceeded when trying to start workflow (v1.7.1) Community Support java-sdk	8	2006	April 18, 2024
Continuous frontend connection timeout Community Support java-sdk , mysql	1	2247	March 25, 2021
Front end errors, temporal server unstable during the error window Community Support frontend	1	718	March 2, 2023
Temporal with istio proxy resulting in "server closed the stream without sending trailers" errors Community Support helm	8	1760	March 22, 2021
504 errors on worker Community Support worker , frontend	2	180	November 8, 2024

Temporal frontend closing the connections causing 499 errors at nginx ingress

Related topics