Temporal io restart problem on Kubernetes node restart

shawel_negussie · September 3, 2020, 3:04pm

I used helm chart to deploy temporal IO in Kubernetes. It was working great but i noticed frontend and history server are failing to restart. I noticed the following log
“waiting for default keyspace to become ready”. is there anywhere to start looking to debug problems.
Should it not recover automatically on node restarts ?

markmark · September 4, 2020, 5:52pm

Hey Shawel,

Thank you for the report!

This might happen when db schemas have not been created on the database server (or, perhaps – and this shouldn’t actually be the case, I would expect a different error message, I would need to try to repro to be sure – if the db server is not reachable).

Do you think you could share (here or in a direct message)

the helm install ... command you ran to install temporal, and
the output of kubectl get pods command
?

That would help me reproduce and debug the problem.

Thank you!
Mark.

shawel_negussie · September 4, 2020, 6:36pm

hey @markmark thanks for the reply!
here is the helm install i used

helm install --namespace temporalio-dev
–set server.replicaCount=1
–set cassandra.config.cluster_size=1
–set prometheus.enabled=false
–set grafana.enabled=false
–set elasticsearch.enabled=false
–set kafka.enabled=false
temporaltest . --timeout 15m

here is the kubectl command

➜ ~ kubectl get pods -n temporalio-dev
NAME READY STATUS RESTARTS AGE
temporaltest-admintools-7c5b9798ff-7jqrf 1/1 Running 0 179m
temporaltest-cassandra-0 1/1 Running 0 24h
temporaltest-frontend-6bf7969c9b-277p5 0/1 Init:2/4 0 27h
temporaltest-history-5ddfb6d5cf-km5hf 0/1 Init:2/4 0 36h
temporaltest-matching-7d67c4fc5-4dskb 0/1 Init:2/4 0 24h
temporaltest-web-7f5bb66cf-lx9xt 1/1 Running 0 24h
temporaltest-worker-7b689c8d-zvrmb 0/1 Init:2/4 0 15h

For connectivity issues i can ping the cassandra se

rvice from the front end

shawel_negussie · September 15, 2020, 3:10pm

hi @markmark

any more insights on this?

best

markmark · September 15, 2020, 5:14pm

Hi Shawel,

Apologies for the slow response on my part!

I am having a hard time reproducing this. Are you still seeing this issue?

Here are my steps (forgive the bird theme;):

Create the namespace:

~/src/helm-charts $ cat ns.json
{
  "apiVersion": "v1",
  "kind": "Namespace",
  "metadata": {
    "name": "duck",
    "labels": {
      "name": "duck"
    }
  }
}
 ~/src/helm-charts $ kubectl create -f ns.json
namespace/duck created

install temporal to that namespace

~/src/helm-charts $ helm install --namespace duck --set server.replicaCount=1 --set cassandra.config.cluster_size=1 --set prometheus.enabled=false --set grafana.enabled=false --set elasticsearch.enabled=false --set kafka.enabled=false temporaltest . --timeout 15m
NAME: temporaltest
LAST DEPLOYED: Tue Sep 15 09:55:57 2020
NAMESPACE: duck
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
To verify that Temporal has started, run:

  kubectl --namespace=duck get pods -l "app.kubernetes.io/instance=temporaltest"

check that the pods are running (giving it a couple of mins, and ignore kibana for now):

~/src/helm-charts $ kubectl get pods --namespace duck
NAME                                       READY   STATUS    RESTARTS   AGE
temporaltest-admintools-76684c7c59-4scgz   1/1     Running   0          4m9s
temporaltest-cassandra-0                   1/1     Running   0          4m9s
temporaltest-frontend-56c4df6845-qdrwn     1/1     Running   2          4m9s
temporaltest-history-9655b4746-hg54s       1/1     Running   3          4m9s
temporaltest-kibana-8c9c486d7-d66l9        0/1     Running   0          4m9s
temporaltest-matching-85f8c9bcc6-gk6mg     1/1     Running   2          4m9s
temporaltest-web-84864fdddc-qjtsm          1/1     Running   0          4m9s
temporaltest-worker-9bd8b68cf-x6zxh        1/1     Running   2          4m9s

do a basic check (create a new temporal namespace, “chickens”;):

~/src/helm-charts $ kubectl --namespace duck exec -it services/temporaltest-admintools -- bash -c 'tctl --ns chickens namespace register'
Namespace chickens successfully registered.

~/src/helm-charts $ kubectl --namespace duck exec -it services/temporaltest-admintools -- bash -c 'tctl --ns chickens namespace describe'
Name: chickens
Id: 7487be56-c238-478a-9b2f-0ffa18e1bb98
Description:
OwnerEmail:
NamespaceData: map[string]string(nil)
State: Registered
RetentionInDays: 72h0m0s
ActiveClusterName: active
Clusters: active
HistoryArchivalState: Disabled
VisibilityArchivalState: Disabled
Bad binaries to reset:
+-----------------+----------+------------+--------+
| BINARY CHECKSUM | OPERATOR | START TIME | REASON |
+-----------------+----------+------------+--------+
+-----------------+----------+------------+--------+

shawel_negussie · September 15, 2020, 5:19pm

Hey @markmark

no problem. thanks for the reply. The failure actually came when i did a rolling update to the kubernetes nodes. It did work for me the first time i installed the chart, created namespace and everything no issues there. It would still come up on new installs. I expected it to come up on node failures, restarts as well which is the current issue.

Best
Shawel

markmark · September 15, 2020, 5:30pm

Ah!!! If you could share more details on the rolling update that you performed (here or in a direct message), that would be super-helpful – we will look into testing that scenario (and fixing bugs that, it sounds like, it might have;). Thank you, Mark.

shawel_negussie · September 15, 2020, 5:36pm

Sure it was a kubernetes version upgrade from v15.0 to 15.1 using kops (kops update cluster --yes). It does recycle the all nodes in the cluster. Temporalio never came up after that.

markmark · September 15, 2020, 5:43pm

Perfect, this is great information, thank you.

shawel_negussie · September 21, 2020, 4:08pm

It seems like the keyspace information from cassandra is gone on node restart from kubernetes. I don’t know why it happens. But i needed to do this by going to admin tool to get this working:

export CASSANDRA_HOST=temporalio-cassandra.temporalio-dev.svc
export CASSANDRA_PORT=9042

temporal-cassandra-tool create -k temporal
temporal-cassandra-tool create -k temporal_visibility --replication-factor 1

export CASSANDRA_KEYSPACE=temporal
temporal-cassandra-tool setup-schema -v 0.0
temporal-cassandra-tool update-schema -d /etc/temporal/schema/cassandra/temporal/versioned/

export CASSANDRA_KEYSPACE=temporal_visibility
temporal-cassandra-tool create -k temporal_visibility --replication-factor 1
temporal-cassandra-tool setup-schema -v 0.0
temporal-cassandra-tool update-schema -d /etc/temporal/schema/cassandra/temporal/versioned/

barbarize · February 15, 2021, 5:15pm

Hi @markmark and @shawel_negussie.

I’m having the same problem with my Temporal.io installation.
Temporal is deployed on a Google Kubernetes Engine in a Production environment. And when nodes restart, the server never up again, and logs are showing the message “waiting for default keyspace to become ready”.

The Helm command used to install was:

helm install \
--set server.replicaCount=1 \
--set cassandra.config.cluster_size=1 \
--set prometheus.enabled=true \
--set grafana.enabled=true \
--set elasticsearch.enabled=true \
--set kafka.enabled=true \
temporal . --timeout 15m -namespace temporal

The workaround proposed by @shawel_negussie works sometimes, but I think it’s not an ideal scenario to do at every restart.

I don’t know, I’m wondering if we are having a problem with config or environment vars. Because I noticed that temporal-admintools are pointing to IPs to reach other services. So, maybe these IPs have changed on k8s node restart.

What do you think about it?

I’ll appreciate your help!

barbarize · February 16, 2021, 3:38am

@ryland or @maxim can help me?

maxim · February 16, 2021, 3:40am

AFAIK we don’t recommend helm for production deployments and especially upgrades.

shawel_negussie · February 16, 2021, 12:49pm

I have switched to my own instance of MySQL since this ticket. It seems more stable! Also node affinity seems to help.

barbarize · February 16, 2021, 4:02pm

What is the recommendation for production deployments?

barbarize · February 16, 2021, 4:03pm

Thanks a lot! I’ll try this too.

cg1972 · April 5, 2021, 11:20pm

I just encountered a similar issue in our test environment. We are running a 3 node cassandra cluster in K8 exclusively for the temporal keyspaces. A k8 worker node hosting one of the cassandra pods started evicting pods. As a result of this the cassandra pod was restarted (on the same node). After this all the temporal keyspaces were gone from cassandra. We also run a separate cassandra cluster on the same worker nodes, these were also restarted without any loss of data so this seems to be specific to the temporal keyspaces.

Has there been any further investigation into how this could occur? Is there anything in temporal that would trigger the keyspaces to be removed or refreshed?

barbarize · April 6, 2021, 10:13am

Did you deploy it using the the helm chart?

I couldn’t resolve this issues with Cassandra on my deployment.

cg1972 · April 6, 2021, 9:21pm

Yes, temporal was deployed using the helm chart and the keyspaces were created using the temporal-cassandra-tool.

I did notice in one of the posts that it is recommended that replication be set to 3 for production where our test environment would have defaulted to replication factor of 1. I doubt this would have affected the keyspaces being removed?

maxim · April 6, 2021, 9:42pm

We strongly advise against running production Cassandra (or any other DB) deployments on K8s, especially using the provided Helm Chart. They are included for quick start scenarios. You are practically guaranteed to run into data loss issues with the Helm Chart based Cassandra deployment.

Topic		Replies	Views
Temporal pods stuck in init state when installed using helm and cassandra as database Community Support helm , cassandra	2	956	May 25, 2022
Issues with temporal pods after upgrading the EKS cluster Server Deployment	0	101	June 20, 2024
Temporal Helm Cassandra problem Community Support helm , cassandra	3	1381	April 6, 2021
Temporal server down - Cassandra version compatibility check failed no connections were made when creating the session Server Deployment cassandra	3	1244	July 13, 2025
Temporal service fails to start in k8s: 'Failed to get current schema version from cassandra' Community Support	2	2237	October 20, 2020

Temporal io restart problem on Kubernetes node restart

Related topics