VisibilityUpsertExecution error unable to search attribute hogging temporal queue and causing other requests to temporal fail

Hi All,

Need help here. Got problem where requests in temporal fail due to time out because queue is full.
The suspect why queue is full is due to old requests to Cassandra fail due to some attributes not found in table in keyspace in cassandra (because the attributes not created yet in cassandra). We need to terminate the problematic workflow manually from temporal web console.

The problem here seems like those workflows / requests are repeatedly executed without retry limit. The question here is how to put the retry limit in temporal for request to cassandra if the request fails?

Please help.

Below are the snippet of the error log regarding this issue:

{"level":"error","ts":"2022-03-18T03:36:21.137Z","msg":"Fail to process task","service":"history","shard-id":2,"address":"10.11.11.116:7234","shard-item":"0xc001ca8280","component":"visibility-queue-processor","shard-id":2,"queue-task-id":8396587,"queue-task-visibility-timestamp":"2022-02-15T05:31:57.516Z","xdc-failover-version":0,"queue-task-type":"VisibilityUpsertExecution","wf-namespace-id":"78d10593-d54f-433f-863a-05cd5c9b345c","wf-id":"test003","wf-run-id":"dac0a45b-6417-4757-b062-4ada8f4e98de","error":"Unable to decode search attributes: invalid search attribute type: Unspecified","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:341","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/temporal/service/history/taskProcessor.go:341\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:222\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"}
{"level":"error","ts":"2022-03-18T03:36:21.137Z","msg":"Critical error processing task, retrying.","service":"history","shard-id":2,"address":"10.11.11.116:7234","shard-item":"0xc001ca8280","component":"visibility-queue-processor","shard-id":2,"queue-task-id":8396587,"queue-task-visibility-timestamp":"2022-02-15T05:31:57.516Z","xdc-failover-version":0,"queue-task-type":"VisibilityUpsertExecution","wf-namespace-id":"78d10593-d54f-433f-863a-05cd5c9b345c","wf-id":"test003","wf-run-id":"dac0a45b-6417-4757-b062-4ada8f4e98de","error":"Unable to decode search attributes: invalid search attribute type: Unspecified","operation-result":"OperationCritical","queue-task-type":"VisibilityUpsertExecution","logging-call-at":"taskProcessor.go:227","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:227\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"}
{"level":"error","ts":"2022-03-18T03:36:21.249Z","msg":"Fail to process task","service":"history","shard-id":2,"address":"10.11.11.116:7234","shard-item":"0xc001ca8280","component":"visibility-queue-processor","shard-id":2,"queue-task-id":8396570,"queue-task-visibility-timestamp":"2022-02-15T05:31:56.434Z","xdc-failover-version":0,"queue-task-type":"VisibilityUpsertExecution","wf-namespace-id":"78d10593-d54f-433f-863a-05cd5c9b345c","wf-id":"test003","wf-run-id":"dac0a45b-6417-4757-b062-4ada8f4e98de","error":"Unable to decode search attributes: invalid search attribute type: Unspecified","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:341","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/temporal/service/history/taskProcessor.go:341\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:222\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"}
{"level":"error","ts":"2022-03-18T03:36:21.249Z","msg":"Critical error processing task, retrying.","service":"history","shard-id":2,"address":"10.11.11.116:7234","shard-item":"0xc001ca8280","component":"visibility-queue-processor","shard-id":2,"queue-task-id":8396570,"queue-task-visibility-timestamp":"2022-02-15T05:31:56.434Z","xdc-failover-version":0,"queue-task-type":"VisibilityUpsertExecution","wf-namespace-id":"78d10593-d54f-433f-863a-05cd5c9b345c","wf-id":"test003","wf-run-id":"dac0a45b-6417-4757-b062-4ada8f4e98de","error":"Unable to decode search attributes: invalid search attribute type: Unspecified","operation-result":"OperationCritical","queue-task-type":"VisibilityUpsertExecution","logging-call-at":"taskProcessor.go:227","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:227\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"}
{"level":"error","ts":"2022-03-18T03:36:21.332Z","msg":"Fail to process task","service":"history","shard-id":2,"address":"10.11.11.116:7234","shard-item":"0xc001ca8280","component":"visibility-queue-processor","shard-id":2,"queue-task-id":8396571,"queue-task-visibility-timestamp":"2022-02-15T05:31:56.434Z","xdc-failover-version":0,"queue-task-type":"VisibilityUpsertExecution","wf-namespace-id":"78d10593-d54f-433f-863a-05cd5c9b345c","wf-id":"test003","wf-run-id":"dac0a45b-6417-4757-b062-4ada8f4e98de","error":"Unable to decode search attributes: invalid search attribute type: Unspecified","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:341","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.

Additional info:

Maybe needs to set Maximum Attempts for Retry Policy?
How to set it?

  1. Are you using Elasticsearch for advanced visibility?
  2. Can you run tctl admin cluster get-search-attributes and post its output here?

Hi Alex,

(1) Yes, I am using ElasticSearch for advanced visibility
(2) Here is the output:

/etc/temporal # tctl admin cluster get-search-attributes
Custom search attributes:
+---------------------+----------+
|        NAME         |   TYPE   |
+---------------------+----------+
| ActivityAssignee    | Keyword  |
| ActivityName        | Keyword  |
| ActivityType        | Keyword  |
| Channel             | Keyword  |
| ClosedTime          | Keyword  |
| CustomBoolField     | Bool     |
| CustomDatetimeField | Datetime |
| CustomDoubleField   | Double   |
| CustomIntField      | Int      |
| CustomKeywordField  | Keyword  |
| CustomStringField   | Text     |
| CustomTextField     | Text     |
| CustomerOrderId     | Keyword  |
| IncidentAssignee    | Keyword  |
| IncidentClosedDate  | Keyword  |
| IncidentRaisedDate  | Keyword  |
| InitiatorId         | Keyword  |
| InitiatorIdType     | Keyword  |
| LastUpdatedDate     | Keyword  |
| OrderStatus         | Keyword  |
| ParentWorkflowId    | Keyword  |
| StartDate           | Keyword  |
| Status              | Keyword  |
| TransactionId       | Keyword  |
+---------------------+----------+
System search attributes:
+-----------------------+----------+
|         NAME          |   TYPE   |
+-----------------------+----------+
| BatcherNamespace      | Keyword  |
| BatcherUser           | Keyword  |
| BinaryChecksums       | Keyword  |
| CloseTime             | Datetime |
| ExecutionDuration     | Int      |
| ExecutionStatus       | Keyword  |
| ExecutionTime         | Datetime |
| HistoryLength         | Int      |
| RunId                 | Keyword  |
| StartTime             | Datetime |
| StateTransitionCount  | Int      |
| TaskQueue             | Keyword  |
| TemporalChangeVersion | Keyword  |
| WorkflowId            | Keyword  |
| WorkflowType          | Keyword  |
+-----------------------+----------+
Storage mappings:
+-----------------------+--------------+
|      COLUMN NAME      | COLUMN TYPE  |
+-----------------------+--------------+
| ActivityAssignee      | keyword      |
| ActivityName          | keyword      |
| ActivityType          | keyword      |
| BatcherNamespace      | keyword      |
| BatcherUser           | keyword      |
| BinaryChecksums       | keyword      |
| Channel               | keyword      |
| CloseTime             | date_nanos   |
| ClosedTime            | keyword      |
| CustomBoolField       | boolean      |
| CustomDatetimeField   | date_nanos   |
| CustomDoubleField     | scaled_float |
| CustomIntField        | long         |
| CustomKeywordField    | keyword      |
| CustomStringField     | text         |
| CustomTextField       | text         |
| CustomerOrderId       | keyword      |
| ExecutionDuration     | long         |
| ExecutionStatus       | keyword      |
| ExecutionTime         | date_nanos   |
| HistoryLength         | long         |
| IncidentAssignee      | keyword      |
| IncidentClosedDate    | keyword      |
| IncidentRaisedDate    | keyword      |
| InitiatorId           | keyword      |
| InitiatorIdType       | keyword      |
| LastUpdatedDate       | keyword      |
| NamespaceId           | keyword      |
| OrderStatus           | keyword      |
| ParentWorkflowId      | keyword      |
| RunId                 | keyword      |
| StartDate             | keyword      |
| StartTime             | date_nanos   |
| StateTransitionCount  | long         |
| Status                | keyword      |
| TaskQueue             | keyword      |
| TemporalChangeVersion | keyword      |
| TransactionId         | keyword      |
| WorkflowId            | keyword      |
| WorkflowType          | keyword      |
+-----------------------+--------------+
Workflow info:
{
  "execution": {
    "workflowId": "temporal-sys-add-search-attributes-workflow",
    "runId": "85e662ed-1038-4184-a8b2-1660719561a0"
  },
  "startTime": "2022-03-17T12:03:11.960876040Z",
  "closeTime": "2022-03-17T12:03:12.095405054Z",
  "status": "Completed"
}
/etc/temporal # date
Sun Mar 20 19:18:51 UTC 2022

Hi Alex,

I come across think link: WorkflowTaskTimeout ignores RetryPolicy · Issue #1848 · temporalio/temporal · GitHub
Seems like setting MaximumAttempts also does no affect.
I tried to modify these 2 files: development.yaml and development_es.yaml (both files are in \etc\temporal\config\dynamicconfig\) below for MaximumAttempts: 3 also does not work (still retry unlimited).

Am I doing it correctly? Please advise.

Thanks & regards,

ridwan

development.yaml:

history.defaultActivityRetryPolicy:
- value:
    InitialIntervalInSeconds: 1
    MaximumIntervalCoefficient: 100.0
    BackoffCoefficient: 2.0
    MaximumAttempts: 3
history.defaultWorkflowRetryPolicy:
- value:
    InitialIntervalInSeconds: 1
    MaximumIntervalCoefficient: 100.0
    BackoffCoefficient: 2.0
    MaximumAttempts: 3
system.advancedVisibilityWritingMode:
  - value: "off"
    constraints: {}

development_es.yaml:

history.defaultActivityRetryPolicy:
- value:
    InitialIntervalInSeconds: 1
    MaximumIntervalCoefficient: 100.0
    BackoffCoefficient: 2.0
    MaximumAttempts: 3
history.defaultWorkflowRetryPolicy:
- value:
    InitialIntervalInSeconds: 1
    MaximumIntervalCoefficient: 100.0
    BackoffCoefficient: 2.0
    MaximumAttempts: 3
system.advancedVisibilityWritingMode:
  - value: "on"
    constraints: {}
system.enableReadVisibilityFromES:
  - value: true
    constraints: {}