Frontend crashlooping after failing to create index in ES

Hi Folks

I recently deployed the latest version of Temporal with following external dependencies.
default: aurora in AWS
visibilitu: aurora in AWS
es-vibility: Elastic Search ( 7.9 ) in AWS

Frontend seems to be constantly restarting.
When I look at “previous” container error, here’s what I see

2021/03/10 21:28:54 Loading config; env=docker,zone=,configDir=/etc/temporal/config
2021/03/10 21:28:54 Loading config files=[/etc/temporal/config/docker.yaml]
{"level":"info","ts":"2021-03-10T21:28:54.919Z","msg":"Starting server for services","value":"[frontend]","logging-call-at":"server.go:110"}
Unable to start server: sql schema version compatibility check failed: dial tcp connect: connection timed out.
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"The mapping definition cannot be nested under a type [_doc] unless include_type_name is set to true."}],"type":"illegal_argument_exception","reason":"The mapping definition cannot be nested under a type [_doc] unless include_type_name is set to true."},"status":400}{"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [temporal-visibility-dev/GtAzt900S-Suqpw0rOvGzw] already exists","index_uuid":"GtAzt900S-Suqpw0rOvGzw","index":"temporal-visibility-dev"}],"type":"resource_already_exists_exception","reason":"index [temporal-visibility-dev/GtAzt900S-Suqpw0rOvGzw] already exists","index_uuid":"GtAzt900S-Suqpw0rOvGzw","index":"temporal-visibility-dev"},"status":400}%

There seems to be two issues

  1. SQL schema version comaptaibility check failure
  2. ES index creation failure

But to me it’s not clear what exactly is the issue.
Can you please provide some pointers on how to triage this issue?

Things to note:

  1. Temporal did create the index successfully in it’s first try. I can see it in ES. I am not sure, why is it throwing error in subsequent attempts. Is this error a red-herring?
  2. We did follow the temporal-sql tool instructions to create schemas in aurora db, so not sure what step are we missing here.
➜  temporal-mysql git:(cancel_calendar_wf) ✗ k get po
NAME                                               READY   STATUS             RESTARTS   AGE
temporal-admintools-58f4f7c68-48bbw                1/1     Running            0          138m
temporal-frontend-85647c697f-gsjxv                 0/1     CrashLoopBackOff   22         138m
temporal-grafana-56b45c99c8-kd8c7                  1/1     Running            0          138m
temporal-history-5979c5b4d8-sdwck                  0/1     CrashLoopBackOff   22         138m
temporal-kube-state-metrics-79bdd5c9db-ktsb4       1/1     Running            0          138m
temporal-matching-67fd59449-4cvx7                  0/1     CrashLoopBackOff   22         138m
temporal-prometheus-alertmanager-c69d5f64f-smb2p   2/2     Running            0          138m
temporal-prometheus-pushgateway-6fb4876f8b-fp5mp   1/1     Running            0          138m
temporal-prometheus-server-56b5cd5478-6p9jv        2/2     Running            0          138m
temporal-web-67875b6f5b-7ddj7                      1/1     Running            0          138m
temporal-worker-6b5c6f75f-mj67m                    0/1     CrashLoopBackOff   22         138m

Note 22 restarts

the real error is connection timeout, can you check your DB is accessible?

First of all, don’t use AWS ES 7.9. It is broken (unless you have dedicated deployment from AWS). Use 7.7.
Secondly, how did you deployed Temporal? Docker image is still defaulted to ES6. That _doc related error clearly points to ES version mismatch (Temporal uses schema from ES6). Our helm charts and docker-compose files are configured to use ES7 by default but not docker images itself. You need to set ES_VERSION env to v7.

Hi Alex,
Thanks for your answers. I tried setting up ES_VERSION=v7 and I am still running into same issue.

  temporal-mysql git:(temporal_170_upgrade) ✗ k logs temporal-frontend-7f6cdc9fb9-mm8vz
+ DB=cassandra
+ ENABLE_ES=true
+ ES_PORT=443
+ ES_SCHEME=https
+ ES_VIS_INDEX=temporal-visibility-dev
+ RF=1
{"level":"info","ts":"2021-03-15T13:08:27.310Z","msg":"Get dynamic config","name":"history.persistenceMaxQPS","value":"3000","default-value":"3000","logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-03-15T13:08:27.331Z","msg":"Get dynamic config","name":"system.advancedVisibilityWritingMode","value":"on","default-value":"on","logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-03-15T13:08:27.331Z","msg":"Get dynamic config","name":"history.visibilityQueue","value":"internal","default-value":"internal","logging-call-at":"config.go:79"}
Unable to start server: visibility index in missing in Elasticsearch config.
{"acknowledged":true}{"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [temporal-visibility-dev/ko8wbOCESR6-Kwf5-vcPqw] already exists","index_uuid":"ko8wbOCESR6-Kwf5-vcPqw","index":"temporal-visibility-dev"}],"type":"resource_already_exists_exception","reason":"index [temporal-visibility-dev/ko8wbOCESR6-Kwf5-vcPqw] already exists","index_uuid":"ko8wbOCESR6-Kwf5-vcPqw","index":"temporal-visibility-dev"},"status":400}%

( removed parts of logs for brevity)

when i describe the pod, i do see environment variables getting set properly

      POD_IP:                               (v1:status.podIP)
      ENABLE_ES:                           true
      ES_SEEDS:                            <removed for security purprose>
      ES_PORT:                             443
      ES_SCHEME:                           https
      ES_VERSION:                          v7
      SERVICES:                            frontend
      TEMPORAL_STORE_PASSWORD:             <set to the key 'password' in secret 'temporal-creds'>  Optional: false
      TEMPORAL_VISIBILITY_STORE_PASSWORD:  <set to the key 'password' in secret 'temporal-creds'>  Optional: false

Actually adding this to config worked
visibilityIndex: "temporal-visibility"