Frontend service continuously spewing errors

ringods · July 30, 2020, 10:23am

Yesterday, I had to redeploy Temporal (still running v0.23.1) because I had to replace my GKE nodes with bigger VMs. So, I removed the K8s deployment completely, replaced the nodes and started everything from scratch.

I probably did something wrong, because since then, the frontend service is spewing errors at a continuous rate. This leads to timeouts in the client app which manages the creation of workflows.

Every minute, there is a repeating series of errors which seems to start with a context deadline exceeded error and this stacktrace:

 "github.com/temporalio/temporal/common/log/loggerimpl.(*loggerImpl).Error
	/temporal/common/log/loggerimpl/logger.go:138
github.com/temporalio/temporal/service/frontend.(*WorkflowHandler).error
	/temporal/service/frontend/workflowHandler.go:3383
github.com/temporalio/temporal/service/frontend.(*WorkflowHandler).StartWorkflowExecution
	/temporal/service/frontend/workflowHandler.go:494
github.com/temporalio/temporal/service/frontend.(*DCRedirectionHandlerImpl).StartWorkflowExecution.func2
	/temporal/service/frontend/dcRedirectionHandler.go:1114
github.com/temporalio/temporal/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect
	/temporal/service/frontend/dcRedirectionPolicy.go:116
github.com/temporalio/temporal/service/frontend.(*DCRedirectionHandlerImpl).StartWorkflowExecution
	/temporal/service/frontend/dcRedirectionHandler.go:1110
github.com/temporalio/temporal/service/frontend.(*AccessControlledWorkflowHandler).StartWorkflowExecution
	/temporal/service/frontend/accessControlledHandler.go:702
github.com/temporalio/temporal/service/frontend.(*WorkflowNilCheckHandler).StartWorkflowExecution
	/temporal/service/frontend/workflowNilCheckHandler.go:112
go.temporal.io/temporal-proto/workflowservice._WorkflowService_StartWorkflowExecution_Handler.func1
	/go/pkg/mod/go.temporal.io/temporal-proto@v0.23.1/workflowservice/service.pb.go:1015
github.com/temporalio/temporal/service/frontend.interceptor
	/temporal/service/frontend/service.go:316
go.temporal.io/temporal-proto/workflowservice._WorkflowService_StartWorkflowExecution_Handler
	/go/pkg/mod/go.temporal.io/temporal-proto@v0.23.1/workflowservice/service.pb.go:1017
google.golang.org/grpc.(*Server).processUnaryRPC
	/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1082
google.golang.org/grpc.(*Server).handleStream
	/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1405
google.golang.org/grpc.(*Server).serveStreams.func1.1
	/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:746"

I have no clue where to start investigating. All help welcome as I need this POC environment to demo to our application to a customer early next week, which means there is a bit of pressure on us to get this working.

samar · July 30, 2020, 4:37pm

Hey @ringods,
The stacktrace you provided above is not very helpful. All I can tell that StartWorkflowExecution calls on frontend are timing out. Do you have other backend errors, persistence errors or service metric you could share with us? I’ll be happy to jump on a zoom considering this is time sensitive for you. Please reach out to me on slack and we can setup a zoom.

ringods · July 30, 2020, 8:21pm

Hello @samar,

Let me first thank you for the short Zoom call you offered. This definitely put me on the right track!

For the people after me, the following could be helpful.

During the call, Samar identified that although I reported problems with the frontend service, it actually was the history service which had MySQL errors. In my setup, I run on GKE with the Google Cloud SQL Proxy for MySQL as a sidecar container. Initially, we suspected this proxy was at fault.

When looking more closely at the logs of the history container, I noticed this line:

[mysql] 2020/07/30 20:37:51 packets.go:33: unexpected EOF

After a search, I came to this ticket:

The issue at hand is that the application wants to use an idle connection which seems to be closed by the other end. That is possible because the default MaxConnLifetime is infinite.

The comments advice to configure your connection pool. This is what I added to the YAML configuration for the Temporal services, under persistence.datastores.default.sql and persistence.datastores.visibility.sql:

                maxConns: 20
                maxIdleConns: 10
                maxConnLifetime: "1h"

After a redeploy, everything seems to be running fine so far. Where my client app timed out after 60 seconds, it now completes in 1.2 seconds.

A glance at the docs and the helm chart do not specify these SQL settings:

github.com

temporalio/temporal/blob/c16eef677d7cec244c6093d0749ba745a958d8fd/docs/persistence.md#mysql-1

# Overview
Temporal has a well defined API interface at the persistence layer. Any database that supports multi-row transactions on
a single shard or partition can be made to work with temporal. This includes cassandra, dynamoDB, auroraDB, MySQL,
Postgres and may others. There are currently three supported database implementations at the persistence layer - 
cassandra and MySQL/Postgres. This doc shows how to run temporal with cassandra and MySQL(Postgres is mostly the same). It also describes the steps involved
in adding support for a new database at the persistence layer.
 
# Getting started on mac
## Cassandra
### Start cassandra
```
brew install cassandra
brew services start cassandra
```
### Install temporal schema
```
cd $GOPATH/github.com/temporalio/temporal
make install-schema
```

This file has been truncated. show original

github.com

temporalio/helm-charts/blob/master/values/values.mysql.yaml

server:
  config:
    persistence:
      default:
        driver: "sql"

        sql:
          driver: "mysql"
          host: _HOST_
          port: 3306
          database: temporal
          user: _USERNAME_
          password: _PASSWORD_

      visibility:
        driver: "sql"

        sql:
          driver: "mysql"
          host: _HOST_

This file has been truncated. show original

To conclude: always configure the connection pool for MySQL/PostgreSQL in a production envrionment!

ringods · July 30, 2020, 8:24pm

As an addendum, I learned about most of the go-sql connection pooling settings from this article:

It was linked in one of the comments in the Github issue mentioned above.

Topic		Replies	Views
Temporal server with the error "UpdateTaskQueue failed" Community Support go-sdk , mysql	8	1519	April 20, 2023
Temporal production deployment stopped working Community Support java-sdk , helm	7	1006	January 15, 2023
Continuous frontend connection timeout Community Support java-sdk , mysql	1	2247	March 25, 2021
DEADLINE_EXCEEDED: deadline exceeded after 9.999933037s Community Support java-sdk	9	2426	July 13, 2023
Front end errors, temporal server unstable during the error window Community Support frontend	1	718	March 2, 2023

Frontend service continuously spewing errors

Related topics