Temporal-service 1.14.4 upgraded, frontend & worker pods are crashing

@SergeyBykov - sorry delay in response.

I have taken the latest master branch code v15.2.
I would see v1.12 version is more stable than other later versions.
Worker pod keep crashing with the below error, other 3 services pod keep running and healthy.
After applying the both PRs still context deadline error. I really don’t understood what is wrong ?

tls:
frontend:
client:
forceTLS: true
rootCaFiles:
- “##CA_CERT_FILE##”
systemWorker:
certFile: /etc/temporal/cacerts/cluster-internode.pem
keyFile: /etc/temporal/cacerts/cluster-internode.key
client:
serverName: “frontend:7233”
rootCaFiles:
- /etc/temporal/cacerts/cluster-ca-intermediate.pem

{“level”:“debug”,“ts”:“2022-03-23T07:26:00.195Z”,“msg”:“Membership heartbeat upserted successfully”,“service”:“worker”,“address”:“100.127.110.24”,“port”:6939,“hostId”:“79e70db5-aa7a-11ec-8a57-be8783a543f3”,“logging-call-at”:“rpMonitor.go:163”}
{“level”:“fatal”,“ts”:“2022-03-23T07:17:53.093Z”,“msg”:“error starting scanner”,“service”:“worker”,“error”:“context deadline exceeded”,“logging-call-at”:“service.go:436”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Fatal\n\t/temporal/common/log/zap_logger.go:150\ngo.temporal.io/server/service/worker.(*Service).startScanner\n\t/temporal/service/worker/service.go:436\ngo.temporal.io/server/service/worker.(*Service).Start\n\t/temporal/service/worker/service.go:343\ngo.temporal.io/server/service/worker.ServiceLifetimeHooks.func1.1\n\t/temporal/service/worker/fx.go:79”}

HI @SergeyBykov - any update on this?

@maxim @tihomir @alex - can any one of you address the above issue which i am facing.

@SergeyBykov - any more update please?

@jaffarsadik have you tried with the recently released version 1.16 that has the mentioned pr applied?
Will ask server team for more info as well.

{“level”:“fatal”,“ts”:“2022-03-23T07:17:53.093Z”,“msg”:“error starting scanner”,“service”:“worker”,“error”:“context deadline exceeded”,

Can you check if you have the correct IP set for TEMPORAL_BROADCAST_ADDRESS in your config?

The above PR is already considered and also configured the Broad Cast Address.

global:
membership:
maxJoinDuration: 300s
broadcastAddress: “##TEMPORAL_POD_IP##”

Our deployment model is all four services (frontend, history, worker & matching) running on its own pod. broadcastAddress value “##TEMPORAL_POD_IP##” is on its own pod ip address.

What could be the expected behavior?

Here is my deploymenty.yaml configuration, let me know what is wrong on this?

persistence:
defaultStore: gcdb-default
visibilityStore: gcdb-visibility
#advancedVisibilityStore: es-visibility
numHistoryShards: 512
datastores:
gcdb-default:
sql:
user: “##GCDB_USER##”
password: “”
pluginName: “postgres”
databaseName: “##GCDB_DATABASE_NAME##”
connectAddr: “##GCDB_CONNECT_ADDR##”
connectProtocol: “tcp”
maxConns: 512
maxIdleConns: 512
maxConnLifetime: “5m”
tls:
enabled: true
caFile: “##CA_CERT_FILE##”
enableHostVerification: false
gcdb-visibility:
sql:
user: “##GCDB_USER##”
password: “”
pluginName: “postgres”
databaseName: “##GCDB_VISIBILITY_DATABASE_NAME##”
connectAddr: “##GCDB_CONNECT_ADDR##”
connectProtocol: “tcp”
maxConns: 512
maxIdleConns: 512
maxConnLifetime: “5m”
tls:
enabled: true
caFile: “##CA_CERT_FILE##”
enableHostVerification: false
global:
membership:
maxJoinDuration: 300s
broadcastAddress: “##TEMPORAL_POD_IP##”
pprof:
port: 7936
metrics:
prometheus:
timerType: “histogram”
listenAddress: “##TEMPORAL_POD_IP##:8000
tls:
frontend:
client:
forceTLS: true
rootCaFiles:
- “##CA_CERT_FILE##”
systemWorker:
certFile: /etc/temporal/cacerts/cluster-internode.pem
keyFile: /etc/temporal/cacerts/cluster-internode.key
client:
serverName: “frontend:7233”
rootCaFiles:
- /etc/temporal/cacerts/cluster-ca-intermediate.pem
services:
frontend:
rpc:
grpcPort: 7233
membershipPort: 6933
bindOnLocalHost: false
bindOnIP: 0.0.0.0

matching:
rpc:
grpcPort: 7235
membershipPort: 6935
bindOnLocalHost: false
bindOnIP: 0.0.0.0

history:
rpc:
grpcPort: 7234
membershipPort: 6934
bindOnLocalHost: false
bindOnIP: 0.0.0.0

worker:
rpc:
grpcPort: 7239
membershipPort: 6939
bindOnLocalHost: false
bindOnIP: 0.0.0.0

clusterMetadata:
enableGlobalNamespace: true
failoverVersionIncrement: 10
masterClusterName: “active”
currentClusterName: “active”
clusterInformation:
active:
enabled: true
initialFailoverVersion: 1
rpcName: “frontend”
rpcAddress: “##TEMPORAL_POD_IP##:7233

dcRedirectionPolicy:
policy: “noop”
toDC: “”

archival:
history:
state: “enabled”
enableRead: true
provider:
s3store:
region: “us-east-1”
endpoint: “##S3_URL##”
filestore:
fileMode: “0666”
dirMode: “0766”

visibility:
state: “enabled”
enableRead: true
provider:
s3store:
region: “us-east-1”
endpoint: “##S3_URL##”
filestore:
fileMode: “0666”
dirMode: “0766”

namespaceDefaults:
archival:
history:
state: “disabled”
URI: “file:///tmp/temporal_archival/development”
visibility:
state: “disabled”
URI: “file:///tmp/temporal_vis_archival/development”

publicClient:
hostPort: “frontend:7233”

dynamicConfigClient:
filepath: “/etc/temporal/config/dynamicconfig/development.yaml”
pollInterval: “10s”

@jaffarsadik sorry for late response. Waiting on input still, will report back asap.

Is it only worker role keeps crashing? Is any other server roles able to start up?
The error log from worker just mean the worker cannot connect to frontend, and based on the message “deadline exceed”, it is most likely due to tls misconfig.

Not sure if this is the problem, but serverName must match to its cert:

Also to help you reduce the scope and debug your issue, you should set disableHostVerification to true. You can enable it back on once you can connect with is disabled first. Then you know where the problem is.