Frontend service continuously spewing errors

Yesterday, I had to redeploy Temporal (still running v0.23.1) because I had to replace my GKE nodes with bigger VMs. So, I removed the K8s deployment completely, replaced the nodes and started everything from scratch.

I probably did something wrong, because since then, the frontend service is spewing errors at a continuous rate. This leads to timeouts in the client app which manages the creation of workflows.

Every minute, there is a repeating series of errors which seems to start with a context deadline exceeded error and this stacktrace:

 "github.com/temporalio/temporal/common/log/loggerimpl.(*loggerImpl).Error
	/temporal/common/log/loggerimpl/logger.go:138
github.com/temporalio/temporal/service/frontend.(*WorkflowHandler).error
	/temporal/service/frontend/workflowHandler.go:3383
github.com/temporalio/temporal/service/frontend.(*WorkflowHandler).StartWorkflowExecution
	/temporal/service/frontend/workflowHandler.go:494
github.com/temporalio/temporal/service/frontend.(*DCRedirectionHandlerImpl).StartWorkflowExecution.func2
	/temporal/service/frontend/dcRedirectionHandler.go:1114
github.com/temporalio/temporal/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect
	/temporal/service/frontend/dcRedirectionPolicy.go:116
github.com/temporalio/temporal/service/frontend.(*DCRedirectionHandlerImpl).StartWorkflowExecution
	/temporal/service/frontend/dcRedirectionHandler.go:1110
github.com/temporalio/temporal/service/frontend.(*AccessControlledWorkflowHandler).StartWorkflowExecution
	/temporal/service/frontend/accessControlledHandler.go:702
github.com/temporalio/temporal/service/frontend.(*WorkflowNilCheckHandler).StartWorkflowExecution
	/temporal/service/frontend/workflowNilCheckHandler.go:112
go.temporal.io/temporal-proto/workflowservice._WorkflowService_StartWorkflowExecution_Handler.func1
	/go/pkg/mod/go.temporal.io/temporal-proto@v0.23.1/workflowservice/service.pb.go:1015
github.com/temporalio/temporal/service/frontend.interceptor
	/temporal/service/frontend/service.go:316
go.temporal.io/temporal-proto/workflowservice._WorkflowService_StartWorkflowExecution_Handler
	/go/pkg/mod/go.temporal.io/temporal-proto@v0.23.1/workflowservice/service.pb.go:1017
google.golang.org/grpc.(*Server).processUnaryRPC
	/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1082
google.golang.org/grpc.(*Server).handleStream
	/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1405
google.golang.org/grpc.(*Server).serveStreams.func1.1
	/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:746"   

I have no clue where to start investigating. All help welcome as I need this POC environment to demo to our application to a customer early next week, which means there is a bit of pressure on us to get this working. :wink:

1 Like

Hey @ringods,
The stacktrace you provided above is not very helpful. All I can tell that StartWorkflowExecution calls on frontend are timing out. Do you have other backend errors, persistence errors or service metric you could share with us? I’ll be happy to jump on a zoom considering this is time sensitive for you. Please reach out to me on slack and we can setup a zoom.

2 Likes

Hello @samar,

Let me first thank you for the short Zoom call you offered. This definitely put me on the right track!

For the people after me, the following could be helpful.

During the call, Samar identified that although I reported problems with the frontend service, it actually was the history service which had MySQL errors. In my setup, I run on GKE with the Google Cloud SQL Proxy for MySQL as a sidecar container. Initially, we suspected this proxy was at fault.

When looking more closely at the logs of the history container, I noticed this line:

[mysql] 2020/07/30 20:37:51 packets.go:33: unexpected EOF

After a search, I came to this ticket:

The issue at hand is that the application wants to use an idle connection which seems to be closed by the other end. That is possible because the default MaxConnLifetime is infinite.

The comments advice to configure your connection pool. This is what I added to the YAML configuration for the Temporal services, under persistence.datastores.default.sql and persistence.datastores.visibility.sql:

                maxConns: 20
                maxIdleConns: 10
                maxConnLifetime: "1h"

After a redeploy, everything seems to be running fine so far. Where my client app timed out after 60 seconds, it now completes in 1.2 seconds.

A glance at the docs and the helm chart do not specify these SQL settings:

To conclude: always configure the connection pool for MySQL/PostgreSQL in a production envrionment!

6 Likes

As an addendum, I learned about most of the go-sql connection pooling settings from this article:

It was linked in one of the comments in the Github issue mentioned above.

3 Likes