Temporal io restart problem on Kubernetes node restart

Hi Maxim,

Just to clarify, the cassandra DB on K8’s that is used for the temporal keyspaces was not deployed using the temporal helm charts. It’s a separate deployment that temporal connects to.

Can you provide details on why the helm chart based cassandra deployment can have data losses? Although we did not use the temporal helm charts for deploying cassandra it would be handy knowing what could cause the losses so I can check our cassandra configuration for similar issues.

I’m not expert in K8s. But AFAIK running production quality persistence on it is very non trivial. And I believe Helm chart is not a right technology to manage databases on K8s.

Thanks Maxim.

Hello @maxim I am running into the same problem. Mine is NOT a production env btw.

I have deployed to a GKE autopilot cluster using the helm chart (with small modifications by setting resource limits). The cassandra headless clusterIP & pods are running but the frontend-pod is not coming up due to the "waiting for default keyspace to become ready” message. How do i resolve this?

thanks Vikram

Hi @Vikram_Bailur
Could you confirm that your db schema is full created before Temporal server comes up?
Helm charts do run some batch jobs to set up the schema which should complete first.
The error you mentioned can happen if cassandra is up but the keyspace hasnt completed setting up.

Hi @tihomir - thanks for your quick response - I think you might be right - i was trying to figure out what is wrong w cassandra. Since Im using GKE Autopilot, resources take some time to get provisioned and I think the batch job might have tried to run without the cassandra pods being fully up & running. This is the message on the Job pod.
*** Can't find temporaltest-cassandra.default.svc.cluster.local: No answer

The cassandra pod also has the following message
containers with unready status: [temporaltest-cassandra]

UPDATE: it finally started up and I think the job has run as well.

I did see in the docs a way to setup the cloudsql proxy in the helm chart(to use GCP Cloud Sql as the db instead of cassandra) but i don’t see specific instructions on how to install the helm chart with that (sorry i am new to helm) - do i just use a bare minimum helm chart and then run the helm install for the cloudsqlproxy ?

that should be fine - everything should wait and fail for a while til it comes up. if the batch job didn’t work and isn’t still trying to run, you can create another one - or just log into the admin tools pod and run the commands the batch job would have run.

if you have enough issue getting cassandra to work and you want to try cloudsql with proxy, you can deploy it to the cluster by setting a value for server.sidecarContainers - https://github.com/temporalio/helm-charts/blob/master/values.yaml#L21

this is just a list of containers that looks like any other kubernetes containers: section. anything you put in there will run in the same pod as temporal server containers.

1 Like