Workflow listed as open but is actually closed

miroslav.lalev · February 7, 2023, 3:34pm

Hi all,

We encountered a weird issue where a workflow is returned from ListOpenWorkflow but when getting it or trying to execute any command on it results in an error that the workflow is already closed is being returned.

Here’s what it looks like in the UI when listing workflows:

And this is what is shown when opening its details page:

Initially, I thought that it might be an eventual consistency issue, but nothing strange has been logged for it; neither has the state updated since the 25th of January.
Any ideas on how to handle this issue?

Edit: Note that while this has been going on other workflows are being created and manage to finish successfully.

tihomir · February 7, 2023, 4:13pm

Do you configure standard or advanced visibility (ES)?

Can you check visibility latencies (server metrics) around this executions completion time (also check visibility store and also check history and ES logs too.

histogram_quantile(0.95, sum(rate(task_latency_bucket{operation=~"VisibilityTask.*", service_name="history"}[1m])) by (operation, le))

I assume if you run
tctl wf desc -w <wfid> -r <runid>

you will see it as completed as well. seems completion update failed to be written to visibility for some reason.

miroslav.lalev · February 7, 2023, 4:32pm

We are using stantard visibility, stored in Cassandra. The Temporal version is 1.18.4.

This is how the requested graph looks like around the time of the workflow:

The first green line marks the time workflow was halfway done and the first red line is around the time the workflow was closed, according to the times in the UI.

Here’s the output of the tctl command for this particular workflow (yes, it shows completed status):

./tctl --address <addr>:7233 wf desc -w "Jh65n4CJNyUqhSkc/0"
{
  "executionConfig": {
    "taskQueue": {
      "name": "WorkflowRunner",
      "kind": "Normal"
    },
    "defaultWorkflowTaskTimeout": "10s"
  },
  "workflowExecutionInfo": {
    "execution": {
      "workflowId": "Jh65n4CJNyUqhSkc/0",
      "runId": "394e4e02-a3f8-4eaa-813d-1d579ae18db4"
    },
    "type": {
      "name": ""
    },
    "startTime": "2023-01-25T13:02:34.680222961Z",
    "closeTime": "2023-01-25T13:05:22.503776559Z",
    "status": "Completed",
    "historyLength": "113",
    "memo": {

    },
    "searchAttributes": {
      "indexedFields": {
        "BinaryChecksums": "[\"7338b8b8d0c371f416f788c76533c44b\"]"
      }
    },
    "autoResetPoints": {
      "points": [
        {
          "binaryChecksum": "7338b8b8d0c371f416f788c76533c44b",
          "runId": "394e4e02-a3f8-4eaa-813d-1d579ae18db4",
          "firstWorkflowTaskCompletedId": "4",
          "createTime": "2023-01-25T13:02:34.957424254Z",
          "resettable": true
        }
      ]
    },
    "stateTransitionCount": "77"
  }
}

I checked the history service logs but did not find anything interesting for that particular workflow ID. Can the error logs not be tagged with it? Shall I search for errors around the completion time of the workflow?

tihomir · February 7, 2023, 4:45pm

We are using stantard visibility, stored in Cassandra

can you check any resource exhausted issues during the same time:

sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

check for any “SystemOverloaded” cause

miroslav.lalev · February 7, 2023, 4:56pm

I couldn’t find such metric, so I used the service_error_with_type one. There are two types of errors around that time, but I don’t think there is a cause in the metric tags.

AddWorkflowTask (green)
RecordWorkflowTaskStarted (brown)

tihomir · February 7, 2023, 6:05pm

Ok so seems low visibility latencies and no system overloaded. Would re-check history service logs and maybe check your cassandra logs too.

You mentioned you havent experienced this issue since, right?

miroslav.lalev · February 8, 2023, 9:35am

You mentioned you havent experienced this issue since, right?

Yes, this happened for exactly one workflow and never since.

Today, I noticed that there were some pod restarts of all temporal services (frontend, history, matching, and worker) very close to the time of the issue. Since it is a test environment, I haven’t configured any horizontal scaling, so the services were not being restarted gracefully - there was a period of downtime for each one of them. Can that be related to my issue?

Edit: Actually, everything in the Kubernetes cluster was being restarted at some point, including the Cassandra. This was caused by an upgrade to a higher cluster version.

Edit (2): I guess that kind of settles what happened. I will configure proper horizontal scaling for the temporal services and test such scenarios again before deploying to production. Is there a way to manually fix the current workflow, so that it doesn’t show as running?

Topic		Replies	Views
Temporal UI not displaying correct status for a workflow Community Support	6	2051	December 13, 2020
Missing workflows in UI, probably issue with visabilitySampling for cassandra Community Support	2	475	April 11, 2021
Sometime worker is not able to find the running workflow by workflowId Community Support general-impl	9	632	May 18, 2022
Web client showing Failed workflow as Running Community Support	1	641	March 13, 2021
InternalWorkflowTaskException java.lang.Error: closed when executing await & then signaling to wake up Community Support java-sdk	2	473	November 6, 2021

Workflow listed as open but is actually closed

Related topics