Elasticsearch contains incomplete data

We are using ES for visibility, but seeing a glitch that some of the workflow executions are not recorded in Elasticsearch but the workflows are executed completely. This behaviour is seen in a particular environment.
Deployment: Using 1.9.2 docker image of all the pods.

Elasticsearch documents (=visibility records) are deleted when retention kicks in. Can you double check retention settings for the namespace which has missing workflows?

ES retention policy is set to 3 days.
In my case when I am registering the workflow, some times workflow metadata are written on ElasticSearch, while some times they are not for same namespace. Could not understand why different behaviour is coming

This is odd and shouldn’t happen. To debug the root cause, can you double check that you can get workflow info from main database with:

tctl workflow describe --workflow_id <workflow_id>

but not from visibility database:

tctl workflow list --query "WorkflowId=<workflow_id>"

If this is a case and you see running workflow with describe command but not with list command then there is something wrong with visibility task processor. Check server logs first and if nothing suspicious there I will give you metric names which might give some insights.

Was able to describe Workflow from database, but was not able to run list command

 tctl workflow list --query "WorkflowId: workflow_31c1f59c-6855-4243-8563-6f04fc6f952e"
Error: Failed to list workflow.
Error Details: Invalid query.
('export TEMPORAL_CLI_SHOW_STACKS=1' to see stack traces)

Also checked logs of history pod, sometimes getting errors like:-

{"level":"error","ts":"2022-01-26T18:11:30.051Z","msg":"Error updating ack level for shard","service":"history","shard-id":302,"address":"192.168.114.15:7934","shard-item":"0xc0013f3c80","component":"visibility-queue-processor","error":"Failed to update shard. previous_range_id: 76, columns: (range_id=77)","operation-result":"OperationFailed","logging-call-at":"queueAckMgr.go:223","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).updateQueueAckLevel\n\t/temporal/service/history/queueAckMgr.go:223\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:240"}
{"level":"error","ts":"2022-01-27T07:00:12.038Z","msg":"elastic: http://elasticsearch-master-headless.temporal-direct:9200 is dead","logging-call-at":"logger.go:48","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/persistence/elasticsearch/client.(*errorLogger).Printf\n\t/temporal/common/persistence/elasticsearch/client/logger.go:48\ngithub.com/olivere/elastic/v7.(*Client).errorf\n\t/go/pkg/mod/github.com/olivere/elastic/v7@v7.0.22/client.go:836\ngithub.com/olivere/elastic/v7.(*Client).healthcheck\n\t/go/pkg/mod/github.com/olivere/elastic/v7@v7.0.22/client.go:1142\ngithub.com/olivere/elastic/v7.(*Client).healthchecker\n\t/go/pkg/mod/github.com/olivere/elastic/v7@v7.0.22/client.go:1080"}

Query fails because it needs to be in valid SQL WHERE format. In your case:

WorkflowId='workflow_31c1f59c-6855-4243-8563-6f04fc6f952e'

First error message indicates that you have shard movements which usually happens when you have 2 clusters using the same database. It even might be 2 history services which can’t join the same cluster and essentially forms 2 independent clusters. You can get more details with:

tctl admin cluster describe

command.

Second error message literally says that Elasticsearch server is not accessible. It either went down unexpectedly or wasn’t configured properly from the very beginning. Temporal is designed to work with unavailable Elasticsearch. It won’t affect workflows but visibility won’t be updated, of course.

Thanks @alex , I descaled the pods of other cluster. They were using same DB but different ES. ES of the second cluster was down, so I think that’s why elastic search is dead error is printing but the address of ES printed is wrong.

I descaled all the pods of second cluster to 0 temporal is working now.
Just one question their is anything else I have to do, or my cluster is fine now?

  1. Check the logs. If you don’t see errors there, then you are most likely good.
  2. Check the cluster view from Temporal perspective with
    tctl admin cluster describe
    
    and compare it with what kubectl (or k9s shows you). Make sure that number of history nodes and history pods (and other services) match.
  3. Check configs for other typos and wrong addresses.