Elasticsearch contains incomplete data

Ashish_Sharma · January 25, 2022, 2:48pm

We are using ES for visibility, but seeing a glitch that some of the workflow executions are not recorded in Elasticsearch but the workflows are executed completely. This behaviour is seen in a particular environment.
Deployment: Using 1.9.2 docker image of all the pods.

alex · January 25, 2022, 6:30pm

Elasticsearch documents (=visibility records) are deleted when retention kicks in. Can you double check retention settings for the namespace which has missing workflows?

Ashish_Sharma · January 25, 2022, 8:37pm

ES retention policy is set to 3 days.
In my case when I am registering the workflow, some times workflow metadata are written on ElasticSearch, while some times they are not for same namespace. Could not understand why different behaviour is coming

alex · January 26, 2022, 8:06pm

This is odd and shouldn’t happen. To debug the root cause, can you double check that you can get workflow info from main database with:

tctl workflow describe --workflow_id <workflow_id>

but not from visibility database:

tctl workflow list --query "WorkflowId=<workflow_id>"

If this is a case and you see running workflow with describe command but not with list command then there is something wrong with visibility task processor. Check server logs first and if nothing suspicious there I will give you metric names which might give some insights.

Ashish_Sharma · January 27, 2022, 12:23pm

Was able to describe Workflow from database, but was not able to run list command

 tctl workflow list --query "WorkflowId: workflow_31c1f59c-6855-4243-8563-6f04fc6f952e"
Error: Failed to list workflow.
Error Details: Invalid query.
('export TEMPORAL_CLI_SHOW_STACKS=1' to see stack traces)

Also checked logs of history pod, sometimes getting errors like:-

{"level":"error","ts":"2022-01-26T18:11:30.051Z","msg":"Error updating ack level for shard","service":"history","shard-id":302,"address":"192.168.114.15:7934","shard-item":"0xc0013f3c80","component":"visibility-queue-processor","error":"Failed to update shard. previous_range_id: 76, columns: (range_id=77)","operation-result":"OperationFailed","logging-call-at":"queueAckMgr.go:223","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).updateQueueAckLevel\n\t/temporal/service/history/queueAckMgr.go:223\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:240"}
{"level":"error","ts":"2022-01-27T07:00:12.038Z","msg":"elastic: http://elasticsearch-master-headless.temporal-direct:9200 is dead","logging-call-at":"logger.go:48","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/persistence/elasticsearch/client.(*errorLogger).Printf\n\t/temporal/common/persistence/elasticsearch/client/logger.go:48\ngithub.com/olivere/elastic/v7.(*Client).errorf\n\t/go/pkg/mod/github.com/olivere/elastic/v7@v7.0.22/client.go:836\ngithub.com/olivere/elastic/v7.(*Client).healthcheck\n\t/go/pkg/mod/github.com/olivere/elastic/v7@v7.0.22/client.go:1142\ngithub.com/olivere/elastic/v7.(*Client).healthchecker\n\t/go/pkg/mod/github.com/olivere/elastic/v7@v7.0.22/client.go:1080"}

alex · January 28, 2022, 12:59am

Query fails because it needs to be in valid SQL WHERE format. In your case:

WorkflowId='workflow_31c1f59c-6855-4243-8563-6f04fc6f952e'

First error message indicates that you have shard movements which usually happens when you have 2 clusters using the same database. It even might be 2 history services which can’t join the same cluster and essentially forms 2 independent clusters. You can get more details with:

tctl admin cluster describe

command.

Second error message literally says that Elasticsearch server is not accessible. It either went down unexpectedly or wasn’t configured properly from the very beginning. Temporal is designed to work with unavailable Elasticsearch. It won’t affect workflows but visibility won’t be updated, of course.

Ashish_Sharma · January 28, 2022, 6:53am

Thanks @alex , I descaled the pods of other cluster. They were using same DB but different ES. ES of the second cluster was down, so I think that’s why elastic search is dead error is printing but the address of ES printed is wrong.

I descaled all the pods of second cluster to 0 temporal is working now.
Just one question their is anything else I have to do, or my cluster is fine now?

alex · January 30, 2022, 8:01am

Check the logs. If you don’t see errors there, then you are most likely good.
Check the cluster view from Temporal perspective with
```
tctl admin cluster describe
```
and compare it with what kubectl (or k9s shows you). Make sure that number of history nodes and history pods (and other services) match.
Check configs for other typos and wrong addresses.

Topic		Replies	Views
Questions on archival feature as of 0.30 Community Support elasticsearch , archival	3	1362	September 25, 2020
Web-UI and tctl won't list workflows Community Support elasticsearch , archival , visibility	2	953	August 4, 2022
Clarifications on Archival and Visibility Community Support archival , advanced_visibility , visibility	1	1099	July 14, 2022
Architectural understanding with data persistence for visibility with Postgres and ES Developer Corner general-impl	2	287	April 15, 2024
Timeouts on visibility-tasks but workflows still completing Community Support	3	499	January 11, 2023

Elasticsearch contains incomplete data

Related topics