Temporal io restart problem on Kubernetes node restart

I used helm chart to deploy temporal IO in Kubernetes. It was working great but i noticed frontend and history server are failing to restart. I noticed the following log
“waiting for default keyspace to become ready”. is there anywhere to start looking to debug problems.
Should it not recover automatically on node restarts ?

1 Like

Hey Shawel,

Thank you for the report!

This might happen when db schemas have not been created on the database server (or, perhaps – and this shouldn’t actually be the case, I would expect a different error message, I would need to try to repro to be sure – if the db server is not reachable).

Do you think you could share (here or in a direct message)

  • the helm install ... command you ran to install temporal, and
  • the output of kubectl get pods command
    ?

That would help me reproduce and debug the problem.

Thank you!
Mark.

hey @markmark thanks for the reply!
here is the helm install i used

helm install --namespace temporalio-dev
–set server.replicaCount=1
–set cassandra.config.cluster_size=1
–set prometheus.enabled=false
–set grafana.enabled=false
–set elasticsearch.enabled=false
–set kafka.enabled=false
temporaltest . --timeout 15m

here is the kubectl command

➜ ~ kubectl get pods -n temporalio-dev
NAME READY STATUS RESTARTS AGE
temporaltest-admintools-7c5b9798ff-7jqrf 1/1 Running 0 179m
temporaltest-cassandra-0 1/1 Running 0 24h
temporaltest-frontend-6bf7969c9b-277p5 0/1 Init:2/4 0 27h
temporaltest-history-5ddfb6d5cf-km5hf 0/1 Init:2/4 0 36h
temporaltest-matching-7d67c4fc5-4dskb 0/1 Init:2/4 0 24h
temporaltest-web-7f5bb66cf-lx9xt 1/1 Running 0 24h
temporaltest-worker-7b689c8d-zvrmb 0/1 Init:2/4 0 15h

For connectivity issues i can ping the cassandra se

rvice from the front end

hi @markmark

any more insights on this?

best

Hi Shawel,

Apologies for the slow response on my part!

I am having a hard time reproducing this. Are you still seeing this issue?

Here are my steps (forgive the bird theme;):

  1. Create the namespace:
~/src/helm-charts $ cat ns.json
{
  "apiVersion": "v1",
  "kind": "Namespace",
  "metadata": {
    "name": "duck",
    "labels": {
      "name": "duck"
    }
  }
}
 ~/src/helm-charts $ kubectl create -f ns.json
namespace/duck created
  1. install temporal to that namespace
~/src/helm-charts $ helm install --namespace duck --set server.replicaCount=1 --set cassandra.config.cluster_size=1 --set prometheus.enabled=false --set grafana.enabled=false --set elasticsearch.enabled=false --set kafka.enabled=false temporaltest . --timeout 15m
NAME: temporaltest
LAST DEPLOYED: Tue Sep 15 09:55:57 2020
NAMESPACE: duck
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
To verify that Temporal has started, run:

  kubectl --namespace=duck get pods -l "app.kubernetes.io/instance=temporaltest"
  1. check that the pods are running (giving it a couple of mins, and ignore kibana for now):
~/src/helm-charts $ kubectl get pods --namespace duck
NAME                                       READY   STATUS    RESTARTS   AGE
temporaltest-admintools-76684c7c59-4scgz   1/1     Running   0          4m9s
temporaltest-cassandra-0                   1/1     Running   0          4m9s
temporaltest-frontend-56c4df6845-qdrwn     1/1     Running   2          4m9s
temporaltest-history-9655b4746-hg54s       1/1     Running   3          4m9s
temporaltest-kibana-8c9c486d7-d66l9        0/1     Running   0          4m9s
temporaltest-matching-85f8c9bcc6-gk6mg     1/1     Running   2          4m9s
temporaltest-web-84864fdddc-qjtsm          1/1     Running   0          4m9s
temporaltest-worker-9bd8b68cf-x6zxh        1/1     Running   2          4m9s
  1. do a basic check (create a new temporal namespace, “chickens”;):
~/src/helm-charts $ kubectl --namespace duck exec -it services/temporaltest-admintools -- bash -c 'tctl --ns chickens namespace register'
Namespace chickens successfully registered.
~/src/helm-charts $ kubectl --namespace duck exec -it services/temporaltest-admintools -- bash -c 'tctl --ns chickens namespace describe'
Name: chickens
Id: 7487be56-c238-478a-9b2f-0ffa18e1bb98
Description:
OwnerEmail:
NamespaceData: map[string]string(nil)
State: Registered
RetentionInDays: 72h0m0s
ActiveClusterName: active
Clusters: active
HistoryArchivalState: Disabled
VisibilityArchivalState: Disabled
Bad binaries to reset:
+-----------------+----------+------------+--------+
| BINARY CHECKSUM | OPERATOR | START TIME | REASON |
+-----------------+----------+------------+--------+
+-----------------+----------+------------+--------+

Hey @markmark

no problem. thanks for the reply. The failure actually came when i did a rolling update to the kubernetes nodes. It did work for me the first time i installed the chart, created namespace and everything no issues there. It would still come up on new installs. I expected it to come up on node failures, restarts as well which is the current issue.

Best
Shawel

Ah!!! If you could share more details on the rolling update that you performed (here or in a direct message), that would be super-helpful – we will look into testing that scenario (and fixing bugs that, it sounds like, it might have;). Thank you, Mark.

Sure it was a kubernetes version upgrade from v15.0 to 15.1 using kops (kops update cluster --yes). It does recycle the all nodes in the cluster. Temporalio never came up after that.

1 Like

Perfect, this is great information, thank you.

It seems like the keyspace information from cassandra is gone on node restart from kubernetes. I don’t know why it happens. But i needed to do this by going to admin tool to get this working:

export CASSANDRA_HOST=temporalio-cassandra.temporalio-dev.svc
export CASSANDRA_PORT=9042

temporal-cassandra-tool create -k temporal
temporal-cassandra-tool create -k temporal_visibility --replication-factor 1

export CASSANDRA_KEYSPACE=temporal
temporal-cassandra-tool setup-schema -v 0.0
temporal-cassandra-tool update-schema -d /etc/temporal/schema/cassandra/temporal/versioned/

export CASSANDRA_KEYSPACE=temporal_visibility
temporal-cassandra-tool create -k temporal_visibility --replication-factor 1
temporal-cassandra-tool setup-schema -v 0.0
temporal-cassandra-tool update-schema -d /etc/temporal/schema/cassandra/temporal/versioned/