Temporal io restart problem on Kubernetes node restart

I used helm chart to deploy temporal IO in Kubernetes. It was working great but i noticed frontend and history server are failing to restart. I noticed the following log
“waiting for default keyspace to become ready”. is there anywhere to start looking to debug problems.
Should it not recover automatically on node restarts ?

1 Like

Hey Shawel,

Thank you for the report!

This might happen when db schemas have not been created on the database server (or, perhaps – and this shouldn’t actually be the case, I would expect a different error message, I would need to try to repro to be sure – if the db server is not reachable).

Do you think you could share (here or in a direct message)

  • the helm install ... command you ran to install temporal, and
  • the output of kubectl get pods command
    ?

That would help me reproduce and debug the problem.

Thank you!
Mark.

hey @markmark thanks for the reply!
here is the helm install i used

helm install --namespace temporalio-dev
–set server.replicaCount=1
–set cassandra.config.cluster_size=1
–set prometheus.enabled=false
–set grafana.enabled=false
–set elasticsearch.enabled=false
–set kafka.enabled=false
temporaltest . --timeout 15m

here is the kubectl command

➜ ~ kubectl get pods -n temporalio-dev
NAME READY STATUS RESTARTS AGE
temporaltest-admintools-7c5b9798ff-7jqrf 1/1 Running 0 179m
temporaltest-cassandra-0 1/1 Running 0 24h
temporaltest-frontend-6bf7969c9b-277p5 0/1 Init:2/4 0 27h
temporaltest-history-5ddfb6d5cf-km5hf 0/1 Init:2/4 0 36h
temporaltest-matching-7d67c4fc5-4dskb 0/1 Init:2/4 0 24h
temporaltest-web-7f5bb66cf-lx9xt 1/1 Running 0 24h
temporaltest-worker-7b689c8d-zvrmb 0/1 Init:2/4 0 15h

For connectivity issues i can ping the cassandra se

rvice from the front end

hi @markmark

any more insights on this?

best

Hi Shawel,

Apologies for the slow response on my part!

I am having a hard time reproducing this. Are you still seeing this issue?

Here are my steps (forgive the bird theme;):

  1. Create the namespace:
~/src/helm-charts $ cat ns.json
{
  "apiVersion": "v1",
  "kind": "Namespace",
  "metadata": {
    "name": "duck",
    "labels": {
      "name": "duck"
    }
  }
}
 ~/src/helm-charts $ kubectl create -f ns.json
namespace/duck created
  1. install temporal to that namespace
~/src/helm-charts $ helm install --namespace duck --set server.replicaCount=1 --set cassandra.config.cluster_size=1 --set prometheus.enabled=false --set grafana.enabled=false --set elasticsearch.enabled=false --set kafka.enabled=false temporaltest . --timeout 15m
NAME: temporaltest
LAST DEPLOYED: Tue Sep 15 09:55:57 2020
NAMESPACE: duck
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
To verify that Temporal has started, run:

  kubectl --namespace=duck get pods -l "app.kubernetes.io/instance=temporaltest"
  1. check that the pods are running (giving it a couple of mins, and ignore kibana for now):
~/src/helm-charts $ kubectl get pods --namespace duck
NAME                                       READY   STATUS    RESTARTS   AGE
temporaltest-admintools-76684c7c59-4scgz   1/1     Running   0          4m9s
temporaltest-cassandra-0                   1/1     Running   0          4m9s
temporaltest-frontend-56c4df6845-qdrwn     1/1     Running   2          4m9s
temporaltest-history-9655b4746-hg54s       1/1     Running   3          4m9s
temporaltest-kibana-8c9c486d7-d66l9        0/1     Running   0          4m9s
temporaltest-matching-85f8c9bcc6-gk6mg     1/1     Running   2          4m9s
temporaltest-web-84864fdddc-qjtsm          1/1     Running   0          4m9s
temporaltest-worker-9bd8b68cf-x6zxh        1/1     Running   2          4m9s
  1. do a basic check (create a new temporal namespace, “chickens”;):
~/src/helm-charts $ kubectl --namespace duck exec -it services/temporaltest-admintools -- bash -c 'tctl --ns chickens namespace register'
Namespace chickens successfully registered.
~/src/helm-charts $ kubectl --namespace duck exec -it services/temporaltest-admintools -- bash -c 'tctl --ns chickens namespace describe'
Name: chickens
Id: 7487be56-c238-478a-9b2f-0ffa18e1bb98
Description:
OwnerEmail:
NamespaceData: map[string]string(nil)
State: Registered
RetentionInDays: 72h0m0s
ActiveClusterName: active
Clusters: active
HistoryArchivalState: Disabled
VisibilityArchivalState: Disabled
Bad binaries to reset:
+-----------------+----------+------------+--------+
| BINARY CHECKSUM | OPERATOR | START TIME | REASON |
+-----------------+----------+------------+--------+
+-----------------+----------+------------+--------+

Hey @markmark

no problem. thanks for the reply. The failure actually came when i did a rolling update to the kubernetes nodes. It did work for me the first time i installed the chart, created namespace and everything no issues there. It would still come up on new installs. I expected it to come up on node failures, restarts as well which is the current issue.

Best
Shawel

Ah!!! If you could share more details on the rolling update that you performed (here or in a direct message), that would be super-helpful – we will look into testing that scenario (and fixing bugs that, it sounds like, it might have;). Thank you, Mark.

Sure it was a kubernetes version upgrade from v15.0 to 15.1 using kops (kops update cluster --yes). It does recycle the all nodes in the cluster. Temporalio never came up after that.

1 Like

Perfect, this is great information, thank you.

It seems like the keyspace information from cassandra is gone on node restart from kubernetes. I don’t know why it happens. But i needed to do this by going to admin tool to get this working:

export CASSANDRA_HOST=temporalio-cassandra.temporalio-dev.svc
export CASSANDRA_PORT=9042

temporal-cassandra-tool create -k temporal
temporal-cassandra-tool create -k temporal_visibility --replication-factor 1

export CASSANDRA_KEYSPACE=temporal
temporal-cassandra-tool setup-schema -v 0.0
temporal-cassandra-tool update-schema -d /etc/temporal/schema/cassandra/temporal/versioned/

export CASSANDRA_KEYSPACE=temporal_visibility
temporal-cassandra-tool create -k temporal_visibility --replication-factor 1
temporal-cassandra-tool setup-schema -v 0.0
temporal-cassandra-tool update-schema -d /etc/temporal/schema/cassandra/temporal/versioned/

Hi @markmark and @shawel_negussie.

I’m having the same problem with my Temporal.io installation.
Temporal is deployed on a Google Kubernetes Engine in a Production environment. And when nodes restart, the server never up again, and logs are showing the message “waiting for default keyspace to become ready”.

The Helm command used to install was:

helm install \
--set server.replicaCount=1 \
--set cassandra.config.cluster_size=1 \
--set prometheus.enabled=true \
--set grafana.enabled=true \
--set elasticsearch.enabled=true \
--set kafka.enabled=true \
temporal . --timeout 15m -namespace temporal

The workaround proposed by @shawel_negussie works sometimes, but I think it’s not an ideal scenario to do at every restart.

I don’t know, I’m wondering if we are having a problem with config or environment vars. Because I noticed that temporal-admintools are pointing to IPs to reach other services. So, maybe these IPs have changed on k8s node restart.

What do you think about it?

I’ll appreciate your help!

@ryland or @maxim can help me?

AFAIK we don’t recommend helm for production deployments and especially upgrades.

I have switched to my own instance of MySQL since this ticket. It seems more stable! Also node affinity seems to help.

What is the recommendation for production deployments?

2 Likes

Thanks a lot! I’ll try this too.

I just encountered a similar issue in our test environment. We are running a 3 node cassandra cluster in K8 exclusively for the temporal keyspaces. A k8 worker node hosting one of the cassandra pods started evicting pods. As a result of this the cassandra pod was restarted (on the same node). After this all the temporal keyspaces were gone from cassandra. We also run a separate cassandra cluster on the same worker nodes, these were also restarted without any loss of data so this seems to be specific to the temporal keyspaces.

Has there been any further investigation into how this could occur? Is there anything in temporal that would trigger the keyspaces to be removed or refreshed?

Did you deploy it using the the helm chart?

I couldn’t resolve this issues with Cassandra on my deployment.

Yes, temporal was deployed using the helm chart and the keyspaces were created using the temporal-cassandra-tool.

I did notice in one of the posts that it is recommended that replication be set to 3 for production where our test environment would have defaulted to replication factor of 1. I doubt this would have affected the keyspaces being removed?

We strongly advise against running production Cassandra (or any other DB) deployments on K8s, especially using the provided Helm Chart. They are included for quick start scenarios. You are practically guaranteed to run into data loss issues with the Helm Chart based Cassandra deployment.