I opened a thread here which contains details of initial issues but that title doesn’t make the most sense as we progressed from that point (feel free to take a look at that one and archive it).
Instead I will add a summary here of where we are at. We need to resolve this as its a major blocker for us to implement Temporal without advanced visibility.
This week we attempted to add Elasticsearch as the visibility store (since its recommended here “for any use case that spawns more than a few Workflow Executions.” - surely that would be most people??).
However, we are now hitting the issue where the web UI, tctl
or Typescript SDK will not list anything if using a query. The persistence store does work and certain commands work fine with that as I will summarise. So we need to get some ideas on what we are doing wrong with our Elasticsearch setup.
-
We are using the Helm charts to deploy (Temporal server v1.21.1 I believe but the admin output suggests v1.21.2)
-
We are using external Elasticsearch
-
This is the output from
tctl admin describe
(hidden some details):
{
"supportedClients": {
"temporal-cli": "\u003c2.0.0",
"temporal-go": "\u003c2.0.0",
"temporal-java": "\u003c2.0.0",
"temporal-php": "\u003c2.0.0",
"temporal-server": "\u003c2.0.0",
"temporal-typescript": "\u003c2.0.0",
"temporal-ui": "\u003c3.0.0"
},
"serverVersion": "1.21.2",
"membershipInfo": {
"currentHost": {
"identity": "***:7233"
},
"reachableMembers": [
"***:6933",
"***:6935",
"***:6939",
"***:6934"
],
"rings": [
{
"role": "frontend",
"memberCount": 1,
"members": [
{
"identity": "***:7233"
}
]
},
{
"role": "history",
"memberCount": 1,
"members": [
{
"identity": "***:7234"
}
]
},
{
"role": "matching",
"memberCount": 1,
"members": [
{
"identity": "***:7235"
}
]
},
{
"role": "worker",
"memberCount": 1,
"members": [
{
"identity": "***:7239"
}
]
}
]
},
"clusterId": "f76485bd-2a88-4de0-95fb-f67fffbdd29e",
"clusterName": "active",
"historyShardCount": 512,
"persistenceStore": "postgres",
"visibilityStore": "elasticsearch",
"versionInfo": {
"current": {
"version": "1.21.2",
"releaseTime": "2023-07-15T02:00:00Z"
},
"recommended": {
"version": "1.21.2",
"releaseTime": "2023-07-15T02:00:00Z"
},
"alerts": [
{
"message": "🪐 A new release is available!",
"severity": "Low"
}
],
"lastUpdateTime": "2023-09-15T14:01:30.099277938Z"
},
"failoverVersionIncrement": "10",
"initialFailoverVersion": "1"
}
- If I do
tctl --address <our-temporal-cluster>:7233 --ns workflow-service-local workflow describe --workflow_id onboarding-owrNJQbjO7ogpUCYu1pIO
(which is a workflow I definitely started via app code), I get:
{
"executionConfig": {
"taskQueue": {
"name": "onboarding-local",
"kind": "Normal"
},
"defaultWorkflowTaskTimeout": "10s"
},
"workflowExecutionInfo": {
"execution": {
"workflowId": "onboarding-owrNJQbjO7ogpUCYu1pIO",
"runId": "2384a4f9-1e13-4ab0-ba6c-4bc6ff39cf29"
},
"type": {
"name": "onboardingWorkflow"
},
"startTime": "2023-09-16T09:17:27.287372185Z",
"closeTime": "2023-09-16T09:18:22.204908200Z",
"status": "Completed",
"historyLength": "11",
"memo": {
},
"searchAttributes": {
"indexedFields": {
"BuildIds": "[\"unversioned\",\"unversioned:@temporalio/worker@1.8.2+dfc0e48fcf9fef5275a9f0336af1ea3398b7f4246c70877a36520a4013f0861c\"]",
"FeasibilityCheckId": "[\"feasibility-101\"]"
}
},
"autoResetPoints": {
"points": [
{
"binaryChecksum": "@temporalio/worker@1.8.2+dfc0e48fcf9fef5275a9f0336af1ea3398b7f4246c70877a36520a4013f0861c",
"runId": "2384a4f9-1e13-4ab0-ba6c-4bc6ff39cf29",
"firstWorkflowTaskCompletedId": "5",
"createTime": "2023-09-16T09:17:27.635288581Z",
"resettable": true
}
]
},
"stateTransitionCount": "6"
},
"pendingActivities": [
{
"activityId": "1",
"activityType": {
"name": "placeHolderActivity"
},
"state": "Scheduled",
"attempt": 1,
"scheduledTime": "2023-09-16T09:18:22.204830005Z",
"expirationTime": "0001-01-01T00:00:00Z"
}
]
}
If I do tctl --address <our-temporal-cluster>:7233 --ns workflow-service-local workflow show --workflow_id onboarding-owrNJQbjO7ogpUCYu1pIO -r 2384a4f9-1e13-4ab0-ba6c-4bc6ff39cf29 --output_filename myhistory.json
, the output of the JSON is:
{
"events": [
{
"eventId": "1",
"eventTime": "2023-09-16T09:17:27.287372185Z",
"eventType": "WorkflowExecutionStarted",
"taskId": "1048576",
"workflowExecutionStartedEventAttributes": {
"workflowType": {
"name": "onboardingWorkflow"
},
"taskQueue": {
"name": "onboarding-local",
"kind": "Normal"
},
"input": {
},
"workflowTaskTimeout": "10s",
"originalExecutionRunId": "2384a4f9-1e13-4ab0-ba6c-4bc6ff39cf29",
"identity": "158768@L-VKRA2PMX",
"firstExecutionRunId": "2384a4f9-1e13-4ab0-ba6c-4bc6ff39cf29",
"attempt": 1,
"firstWorkflowTaskBackoff": "0s",
"searchAttributes": {
"indexedFields": {
"FeasibilityCheckId": {
"metadata": {
"encoding": "anNvbi9wbGFpbg==",
"type": "S2V5d29yZA=="
},
"data": "WyJmZWFzaWJpbGl0eS0xMDEiXQ=="
}
}
},
"header": {
}
}
},
{
"eventId": "2",
"eventTime": "2023-09-16T09:17:27.287419338Z",
"eventType": "WorkflowExecutionSignaled",
"taskId": "1048577",
"workflowExecutionSignaledEventAttributes": {
"signalName": "feasibility-check-set",
"input": {
"payloads": [
{
"metadata": {
"encoding": "anNvbi9wbGFpbg=="
},
"data": "eyJydW5TdGF0dXMiOiJyZXF1ZXN0ZWQifQ=="
}
]
},
"identity": "158768@L-VKRA2PMX",
"header": {
}
}
},
{
"eventId": "3",
"eventTime": "2023-09-16T09:17:27.287422522Z",
"eventType": "WorkflowTaskScheduled",
"taskId": "1048578",
"workflowTaskScheduledEventAttributes": {
"taskQueue": {
"name": "onboarding-local",
"kind": "Normal"
},
"startToCloseTimeout": "10s",
"attempt": 1
}
},
{
"eventId": "4",
"eventTime": "2023-09-16T09:17:27.338072835Z",
"eventType": "WorkflowTaskStarted",
"taskId": "1048582",
"workflowTaskStartedEventAttributes": {
"scheduledEventId": "3",
"identity": "157044@L-VKRA2PMX",
"requestId": "5567872e-2db3-4ea5-b5b6-5ffe27e219d1",
"historySizeBytes": "470"
}
},
{
"eventId": "5",
"eventTime": "2023-09-16T09:17:27.635281941Z",
"eventType": "WorkflowTaskCompleted",
"taskId": "1048586",
"workflowTaskCompletedEventAttributes": {
"scheduledEventId": "3",
"startedEventId": "4",
"identity": "157044@L-VKRA2PMX",
"workerVersioningId": {
"workerBuildId": "@temporalio/worker@1.8.2+dfc0e48fcf9fef5275a9f0336af1ea3398b7f4246c70877a36520a4013f0861c"
},
"sdkMetadata": {
"coreUsedFlags": [
2,
1
]
},
"meteringMetadata": {
}
}
},
{
"eventId": "6",
"eventTime": "2023-09-16T09:18:22.099776704Z",
"eventType": "WorkflowExecutionSignaled",
"taskId": "1048589",
"workflowExecutionSignaledEventAttributes": {
"signalName": "feasibility-check-set",
"input": {
"payloads": [
{
"metadata": {
"encoding": "anNvbi9wbGFpbg=="
},
"data": "eyJydW5TdGF0dXMiOiJjb21wbGV0ZWQifQ=="
}
]
},
"identity": "158943@L-VKRA2PMX",
"header": {
}
}
},
{
"eventId": "7",
"eventTime": "2023-09-16T09:18:22.099782291Z",
"eventType": "WorkflowTaskScheduled",
"taskId": "1048590",
"workflowTaskScheduledEventAttributes": {
"taskQueue": {
"name": "157044@L-VKRA2PMX-05683f1fc9044b7c95fa277973476379",
"kind": "Sticky"
},
"startToCloseTimeout": "10s",
"attempt": 1
}
},
{
"eventId": "8",
"eventTime": "2023-09-16T09:18:22.123643347Z",
"eventType": "WorkflowTaskStarted",
"taskId": "1048594",
"workflowTaskStartedEventAttributes": {
"scheduledEventId": "7",
"identity": "157044@L-VKRA2PMX",
"requestId": "a29409b9-7eb0-43e4-8b7a-de4ec0be793a",
"historySizeBytes": "939"
}
},
{
"eventId": "9",
"eventTime": "2023-09-16T09:18:22.204748365Z",
"eventType": "WorkflowTaskCompleted",
"taskId": "1048598",
"workflowTaskCompletedEventAttributes": {
"scheduledEventId": "7",
"startedEventId": "8",
"identity": "157044@L-VKRA2PMX",
"workerVersioningId": {
"workerBuildId": "@temporalio/worker@1.8.2+dfc0e48fcf9fef5275a9f0336af1ea3398b7f4246c70877a36520a4013f0861c"
},
"sdkMetadata": {
},
"meteringMetadata": {
}
}
},
{
"eventId": "10",
"eventTime": "2023-09-16T09:18:22.204830005Z",
"eventType": "ActivityTaskScheduled",
"taskId": "1048599",
"activityTaskScheduledEventAttributes": {
"activityId": "1",
"activityType": {
"name": "placeHolderActivity"
},
"taskQueue": {
"name": "onboarding-local",
"kind": "Normal"
},
"header": {
},
"scheduleToCloseTimeout": "0s",
"scheduleToStartTimeout": "0s",
"startToCloseTimeout": "7200s",
"heartbeatTimeout": "0s",
"workflowTaskCompletedEventId": "9",
"retryPolicy": {
"initialInterval": "1s",
"backoffCoefficient": 1,
"maximumInterval": "100s"
}
}
},
{
"eventId": "11",
"eventTime": "2023-09-16T09:18:22.204908200Z",
"eventType": "WorkflowExecutionCompleted",
"taskId": "1048600",
"workflowExecutionCompletedEventAttributes": {
"result": {
"payloads": [
{
"metadata": {
"encoding": "YmluYXJ5L251bGw="
}
}
]
},
"workflowTaskCompletedEventId": "9"
}
}
]
}
-
In the app code, I can start the workflow fine, and then if I explicitly use the workflow ID, I can get the handle and signal to complete the workflow etc.
-
However, no use of list filter via a query will work. The web ui just shows empty. I have checked the console and no errors sticking out to me - I can see the query string being added as expected. It worked fine before when we didnt have a compatible visibility store added.
-
Similarly if I do
tctl --address <our-temporal-cluster>:7233 --ns workflow-service-local workflow l -q "ExecutionStatus='Running'"
or forExecutionStatus
when closed or using a custom search attribute - it always returns an empty list. -
I looked at logs from
frontend
andmatching
pods but couldn’t see any errors or relevant errors. One error I did spot in thehistory
logs is:
{"level":"error","ts":"2023-09-15T14:00:54.689Z","msg":"Unable to process new range","shard-id":81,"address":"172.16.65.134:7234","component":"timer-queue-processor","error":"shard status unknown","logging-call-at":"queue_base.go:316","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/queues.(*queueBase).processNewRange\n\t/home/builder/temporal/service/history/queues/queue_base.go:316\ngo.temporal.io/server/service/history/queues.(*scheduledQueue).processEventLoop\n\t/home/builder/temporal/service/history/queues/queue_scheduled.go:218"}
- The issue is definitely in some Elasticsearch configuration or setup we have done but right now confused as to what it could be. Is there anything else we can look at or that we can provide to help debug? This is a big blocker for us as we need to make sure we have advanced visibility (Im guessing Elasticsearch is the way to go?). Any help would be greatly appreciated.