Elasticsearch error for visibilitye

I get the following error when I try to use Elasticsearch for visibility.

{"level":"error","ts":"2021-04-06T19:42:42.447Z","msg":"Internal service error","service":"frontend","error":"ListOpenWorkflowExecutions failed. Error: elastic: Error 400 (Bad Request): all shards failed [type=search_phase_execution_exception]","logging-call-at":"workflowHandler.go:3406","stacktrace":"go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).error\n\t/temporal/service/frontend/workflowHandler.go:3406\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).ListOpenWorkflowExecutions\n\t/temporal/service/frontend/workflowHandler.go:2424\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).ListOpenWorkflowExecutions.func2\n\t/temporal/service/frontend/dcRedirectionHandler.go:367\ngo.temporal.io/server/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect\n\t/temporal/service/frontend/dcRedirectionPolicy.go:116\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).ListOpenWorkflowExecutions\n\t/temporal/service/frontend/dcRedirectionHandler.go:363\ngo.temporal.io/api/workflowservice/v1._WorkflowService_ListOpenWorkflowExecutions_Handler.func1\n\t/go/pkg/mod/go.temporal.io/api@v1.4.0/workflowservice/v1/service.pb.go:1389\ngo.temporal.io/server/common/authorization.(*interceptor).Interceptor\n\t/temporal/common/authorization/interceptor.go:136\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:1051\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:100\ngoogle.golang.org/grpc.chainUnaryServerInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:1037\ngo.temporal.io/api/workflowservice/v1._WorkflowService_ListOpenWorkflowExecutions_Handler\n\t/go/pkg/mod/go.temporal.io/api@v1.4.0/workflowservice/v1/service.pb.go:1391\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:1210\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:1533\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:871"}

I am using Elasticsearch version 7.9.2. Have any one else faced similar issue?

This is most likely ES version mismatch. How do you run Temporal? docker-compose files are configured to use v7 but docker image itself use v6 by default. ES_VERISION env needs to be set to v7 similar to this. Also proper index template needs to be used for v7.

Thanks, @alex.

I am running Temporal using the Helm Chart from GitHub - temporalio/helm-charts: Temporal Helm charts. I used the command

helm install \
    -f values/values.elasticsearch.yaml \
    --set server.replicaCount=1 \
    --set cassandra.config.cluster_size=1 \
    --set prometheus.enabled=false \
    --set grafana.enabled=false \
    --set kafka.enabled=false \
    --set server.kafka.host=kafka-host:9092 \
    temporaltest . --timeout 15m 

I can also see that the ES_VERSION is set to v7 (Helm Chart installation does that automatically).

Please let me know if you see any problem with the steps I am following.

Also, do I need to follow any additional step to use the index template you have mentioned above?

Are you using external ES or want to deploy ES using helm? If you want to deploy it you should not use -f values/values.elasticsearch.yaml because this file configures cluster to use ES which is already deployed in the same kube cluster (don’t ask me why). And if you just remove this line, it will use defaults from values.yaml which will deploy ES and create index.

You don’t need to do anything manually to create index or its template.

Hi @alex, I am using an external ES cluster and have configured the details in values.elasticsearch.yaml before running helm install.

I can also see the index temporal-visibility-dev created in the external ES. When I execute a workflow, I can see an entry like the following getting inserted in ES:

{
“_index” : “temporal-visibility-dev”,
“_type” : “_doc”,
“_id” : “temporal-sys-history-scanner~6295c6bb-3de6-4d97-9faa-0d7ab69ebf67”,
“_score” : 1.0,
“_source” : {
“Attr” : { },
“CloseTime” : 1617883200978285235,
“ExecutionStatus” : 6,
“ExecutionTime” : 1617883200899515761,
“HistoryLength” : 11,
“NamespaceId” : “32049b68-7872-4094-8e63-d0dd59896a83”,
“RunId” : “6295c6bb-3de6-4d97-9faa-0d7ab69ebf67”,
“StartTime” : 1617840000899515761,
“TaskQueue” : “temporal-sys-history-scanner-taskqueue-0”,
“VisibilityTaskKey” : “216~1048709”,
“WorkflowId” : “temporal-sys-history-scanner”,
“WorkflowType” : “temporal-sys-history-scanner-workflow”
}
}

However, when I try to access the Web UI, I get the error:

Also, get the following error in the temporaltest-frontend pod:

{“level”:“error”,“ts”:“2021-04-08T19:05:48.089Z”,“msg”:“Internal service error”,“service”:“frontend”,“error”:“ListClosedWorkflowExecutions failed. Error: elastic: Error 400 (Bad Request): all shards failed [type=search_phase_execution_exception]”,“logging-call-at”:“workflowHandler.go:3406”,“stacktrace”:“go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).error\n\t/temporal/service/frontend/workflowHandler.go:3406\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).ListClosedWorkflowExecutions\n\t/temporal/service/frontend/workflowHandler.go:2539\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).ListClosedWorkflowExecutions.func2\n\t/temporal/service/frontend/dcRedirectionHandler.go:337\ngo.temporal.io/server/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect\n\t/temporal/service/frontend/dcRedirectionPolicy.go:116\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).ListClosedWorkflowExecutions\n\t/temporal/service/frontend/dcRedirectionHandler.go:333\ngo.temporal.io/api/workflowservice/v1._WorkflowService_ListClosedWorkflowExecutions_Handler.func1\n\t/go/pkg/mod/go.temporal.io/api@v1.4.0/workflowservice/v1/service.pb.go:1407\ngo.temporal.io/server/common/authorization.(*interceptor).Interceptor\n\t/temporal/common/authorization/interceptor.go:136\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:1051\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:100\ngoogle.golang.org/grpc.chainUnaryServerInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:1037\ngo.temporal.io/api/workflowservice/v1._WorkflowService_ListClosedWorkflowExecutions_Handler\n\t/go/pkg/mod/go.temporal.io/api@v1.4.0/workflowservice/v1/service.pb.go:1409\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:1210\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:1533\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:871”}

Can you do GET /temporal-visibility-dev/_mapping and GET /_template/temporal-visibility-template for your ES cluster and post response here?

Hi @alex, here are the responses:

GET /temporal-visibility-dev/_mapping - 200 OK
{
      "temporal-visibility-dev" : {
        "mappings" : {
          "properties" : {
            "Attr" : {
              "properties" : {
                "BinaryChecksums" : {
                  "type" : "text",
                  "fields" : {
                    "keyword" : {
                      "type" : "keyword",
                      "ignore_above" : 256
                    }
                  }
                }
              }
            },
            "CloseTime" : {
              "type" : "long"
            },
            "ExecutionStatus" : {
              "type" : "long"
            },
            "ExecutionTime" : {
              "type" : "long"
            },
            "HistoryLength" : {
              "type" : "long"
            },
            "NamespaceId" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "RunId" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "StartTime" : {
              "type" : "long"
            },
            "TaskQueue" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "VisibilityTaskKey" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "WorkflowId" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "WorkflowType" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            }
          }
        }
      }
    }

GET /_template/temporal-visibility-template - 404 Not Found
{ }

Ok, this is what I suspected. For some reason you don’t have index template and it wasn’t applied when temporal-visibility-dev index was created. And this where all other problems comes from.

To fix this I would suggest you to drop your current index and recreate it with two curl commands the same way as we do it for server development.

And it would be great to investigate why did it happen. Am I right that you have:

  enabled: false
  external: true

in your values/values.elasticsearch.yaml file?

Another thing to check is ENABLE_ES env in any server container. It should be true.

Can you confirm this.

1 Like

Thanks, @alex.

I have verified that ENABLE_ES is true in history, matching, and frontend servers.

Below is the values/values.elasticsearch.yaml I used:

elasticsearch:
  enabled: false
  external: true
  host: "elasticsearch-host"
  port: "9200"
  version: "v7"
  scheme: "http"
  logLevel: "error"
  username: "user"
  password: "pass"

Please let me know if you see anything wrong here. I will recreate the index in the meantime.

Honestly, I don’t know what happened. If ENABLE_ES env is set to true then server should create index template and index during start up but it didn’t. The easiest fix for you would be to just drop the index and create it manually.

It would be great if, instead of manually creating index, you do another deploy using same helm command and send me startup logs. Unfortunately it is not possible to say which instance actually created the index, so I will need these startup logs from every role.

Btw, I am going to change this startup logic soon.

Thanks, @alex. I dropped the index and recreated it manually after creating the template. Things are working fine now.

Before doing that, I also ran the helm command again to understand if this was an intermittent issue. But the same problem happened and I have captured the startup logs. How can I share the logs with you?

Also, if I don’t want to use Kafka, what changes should I make to the helm command and different values.yaml files?

When Kafka is not used, does Temporal write to Elasticsearch synchronously? Does it have any effect on performance?

Hi @alex, I had a few queries regarding the visibility feature with Elasticsearch.

  • I am using v1.7.0. I have installed it using Helm chart. As per documentation, Kafka is used by default for versions below v1.8.0. When I give any random Kafka host, I can still see that the installation goes through and data is inserted to Elasticsearch. Not sure how is this happening if the Kafka host itself does not exist?

  • How do I disable the usage of Kafka for v1.7.0? Does Temporal write to Elasticsearch synchronously when Kafka is disabled?

  • In the link, the index name is hard-coded as temporal-visibility-dev. Does it mean if I use any other name, I have to create it manually?

  • In the index, I see only workflow details like the following? Does visibility UI get the other information like activity details from Cassandra?

{
        "_index" : "temporal-visibility-stage",
        "_type" : "_doc",
        "_id" : "91edd8ae-f19c-448a-bdfb-2d3d86183689~41f0c013-a557-4d7b-ab6c-848df823924e",
        "_score" : null,
        "_source" : {
          "Attr" : { },
          "CloseTime" : 1619719013636897089,
          "ExecutionStatus" : 2,
          "ExecutionTime" : 0,
          "HistoryLength" : 21,
          "NamespaceId" : "9385d477-70c5-485e-983a-39dad3dda3ec",
          "RunId" : "41f0c013-a557-4d7b-ab6c-848df823924e",
          "StartTime" : 1619719011852513773,
          "TaskQueue" : "QARTH_SOURCE_TASK_QUEUE",
          "VisibilityTaskKey" : "109~15260975155",
          "WorkflowId" : "91edd8ae-f19c-448a-bdfb-2d3d86183689",
          "WorkflowType" : "SourceWorkflow"
        },
        "sort" : [
          1619719011852513773
        ]
      }
  • For default namespace, retention period is unknown and data is not written to Elasticsearch once the Workflow completes. Is that a bug or the expected behavior?

  • I got the Temporal version from the Temporal UI? Is there any command or some other means of getting the Temporal version?

  1. I believe docs says that starting 1.8.0 Kafka is no longer supported but default was changed in 1.7.0 and its release notes explicitly says this.
  2. So it is disabled by default. Internal queue is used though and all previous guarantees are still there: if Elasticsearch is down, Temporal will continue to work and it won’t affect core services. Visibility won’t be updated (of course) but all requests will be stored and as soon as ES is back again, everything will be flushed.
  3. Currently index is created when server starts using name from $ES_VIS_INDEX env. Makefile is just for development purpose. You can set it to anything but if name doesn’t start with "temporal-visibility- you need to change template also.
  4. Yes. Basically only list is served from Elasticsearch. As soon as we know Workflow Id/Run Id we go to main database to fetch all required info (history, etc).
  5. Default retention for default namespace should be 1 day. What is unknown? Where do you see it? But even without retention completed workflows should be marked as Completed in Elasticsearch. Retention only controls deletion of the records. If you see workflow not being completed, I need more info on it.
  6. The best way to get server version is to run server binary (temporal-server) and it will output it’s version. tctl also prints it’s version and it is currently in sync with server version. Web version can be different.