Temporal server with the error "UpdateTaskQueue failed"

I found there is many error log in my Temporal server’s log.

{"level":"error","ts":"2023-04-07T10:36:11.662+0800","msg":"Operation failed with internal error.","error":"UpdateTaskQueue failed. Failed to start transaction. Error: invalid connection","operation":"UpdateTaskQueue","logging-call-at":"persistenceMetricClients.go:1171","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/common/log/zap_logger.go:150\ngo.temporal.io/server/common/persistence.updateErrorMetric\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/common/persistence/persistenceMetricClients.go:1171\ngo.temporal.io/server/common/persistence.(*metricEmitter).recordRequestMetrics\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/common/persistence/persistenceMetricClients.go:1148\ngo.temporal.io/server/common/persistence.(*taskPersistenceClient).UpdateTaskQueue.func1\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/common/persistence/persistenceMetricClients.go:555\ngo.temporal.io/server/common/persistence.(*taskPersistenceClient).UpdateTaskQueue\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/common/persistence/persistenceMetricClients.go:557\ngo.temporal.io/server/service/matching.(*taskQueueDB).updateTaskQueue\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/service/matching/db.go:376\ngo.temporal.io/server/service/matching.(*taskQueueDB).UpdateState\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/service/matching/db.go:186\ngo.temporal.io/server/service/matching.(*taskReader).persistAckLevel\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/service/matching/taskReader.go:297\ngo.temporal.io/server/service/matching.(*taskReader).getTasksPump\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/service/matching/taskReader.go:209\ngo.temporal.io/server/internal/goro.(*Group).Go.func1\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/internal/goro/group.go:58"}
{"level":"error","ts":"2023-04-07T10:36:11.662+0800","msg":"Persistent store operation failure","service":"matching","component":"matching-engine","wf-task-queue-name":"/_sys/temporal-sys-tq-scanner-taskqueue-0/1","wf-task-queue-type":"Activity","wf-namespace":"temporal-system","store-operation":"update-task-queue","error":"UpdateTaskQueue failed. Failed to start transaction. Error: invalid connection","logging-call-at":"taskReader.go:212","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/common/log/zap_logger.go:150\ngo.temporal.io/server/service/matching.(*taskReader).getTasksPump\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/service/matching/taskReader.go:212\ngo.temporal.io/server/internal/goro.(*Group).Go.func1\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/internal/goro/group.go:58"}

Temporal server version: 1.20.0
Mysql version: v57

It seems that the error was reported from this part of the code.The m.txExecute

func (m *sqlTaskManager) UpdateTaskQueue(
	ctx context.Context,
	request *persistence.InternalUpdateTaskQueueRequest,
) (*persistence.UpdateTaskQueueResponse, error) {
	nidBytes, err := primitives.ParseUUID(request.NamespaceID)
	if err != nil {
		return nil, serviceerror.NewInternal(err.Error())
	}

	tqId, tqHash := m.taskQueueIdAndHash(nidBytes, request.TaskQueue, request.TaskType)
	var resp *persistence.UpdateTaskQueueResponse
	err = m.txExecute(ctx, "UpdateTaskQueue", func(tx sqlplugin.Tx) error {
		if err := lockTaskQueue(ctx,
			tx,
			tqHash,
			tqId,
			request.PrevRangeID,
		); err != nil {
			return err
		}
		result, err := tx.UpdateTaskQueues(ctx, &sqlplugin.TaskQueuesRow{
			RangeHash:    tqHash,
			TaskQueueID:  tqId,
			RangeID:      request.RangeID,
			Data:         request.TaskQueueInfo.Data,
			DataEncoding: request.TaskQueueInfo.EncodingType.String(),
		})
		if err != nil {
			return err
		}
		rowsAffected, err := result.RowsAffected()
		if err != nil {
			return err
		}
		if rowsAffected != 1 {
			return fmt.Errorf("%v rows were affected instead of 1", rowsAffected)
		}
		resp = &persistence.UpdateTaskQueueResponse{}
		return nil
	})
	return resp, err
}

The issue is that the sample workflow occasionally gets stuck at either the WorkflowTaskScheduled event or the ActivityTaskSchedule event, and remains there until the workflow eventually times out.

I’m not sure if this issue is related to the error logs mentioned above.

Error: invalid connection

Looks like issues connecting to your persistence store. Does this only happen with matching hosts? Check your matching hosts static configs (/etc/temporal/config/docker.yaml).

Maybe also worth checking persistence latencies:

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

and resource exhausted on SystemOverloaded cause:

sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

The metric persistence_latency_bucket appears to be normal.

And the metric service_errors_resource_exhausted didn’t emit.

We are using a DBProxy in front of our MySQL database. Could you please tell me what SQL statements are used in the m.txExecute method? I want to check if it’s because our DBProxy does not support them.

Yes,it only happen with matching service.My frontend, matching, history, worker services use the same static config. I start the service with the cmd

temporal-server --config="config/" start --services="frontend,matching,history,worker"

Any relevant errors in your dbproxy logs?
The errors you shared point to connection issues, “Error: invalid connection”.

The issue is that the sample workflow occasionally gets stuck at either the WorkflowTaskScheduled event or the ActivityTaskSchedule event, and remains there until the workflow eventually times out.

Also if you can share event history of one of those “stuck” executions please.

There isn’t any log in dbproxy.

And there is a stuck workflow that hang at WorkflowTaskScheduled event.

[
  {
    "eventId": "1",
    "eventTime": "2023-04-16T11:49:00.299533562Z",
    "eventType": "WorkflowExecutionStarted",
    "version": "0",
    "taskId": "4468293",
    "workerMayIgnore": false,
    "workflowExecutionStartedEventAttributes": {
      "workflowType": {
        "name": "FeaturePaletteProducerWorkflow"
      },
      "parentWorkflowNamespace": "",
      "parentWorkflowNamespaceId": "",
      "parentWorkflowExecution": null,
      "parentInitiatedEventId": "0",
      "taskQueue": {
        "name": "feature-palette",
        "kind": "Normal"
      },
      "input": {
        "payloads": [
          {
            "metadata": {
              "encoding": "anNvbi9wbGFpbg=="
            },
            "data": "eyJmZWF0dXJlX25hbWUiOiJhcmVhLkRpc3BhdGNoLmRpc3BhdGNoX3N0b2NrZWQiLCJzcWwiOiJzZWxlY3QgY2l0eV9pZCBhcyBjaXR5X2lkLCBmcm9tX2FyZWFfaWQgYXMgb2JqZWN0X2lkLCAnYXJlYTpEaXNwYXRjaDpkaXNwYXRjaF9zdG9ja2VkJyBhcyBmZWF0dXJlX25hbWUsIHN1bShjYXNlIHdoZW4gZmlyc3Rfc2NvcmVfdHMgPSAwIHRoZW4gMSBlbHNlIDAgZW5kKSAgYXMgZmVhdHVyZV92YWx1ZSBmcm9tIGRfZGVsaXZlcnlfYWxsICAgIHdoZXJlIChjcmVhdGVfdGltZSBcdTAwM2UgVU5JWF9USU1FU1RBTVAoKSAtIDYwMCBhbmQgZnJvbV9hcmVhX2lkIFx1MDAzZSAwKSAgICBncm91cCBieSBjaXR5X2lkLCBmcm9tX2FyZWFfaWQiLCJkYXRhX3NvdXJjZSI6ImRlbGl2ZXJ5X29yZGVyX3dpZGVfdGFibGUiLCJjcm9uX3N0ciI6IiogKiAqICogKiIsInRpbWVvdXQiOjUwMDAwMDAwMDAwfQ=="
          }
        ]
      },
      "workflowExecutionTimeout": "0s",
      "workflowRunTimeout": "50s",
      "workflowTaskTimeout": "10s",
      "continuedExecutionRunId": "b6f741c8-325c-464b-860c-74ce58352091",
      "initiator": "CronSchedule",
      "continuedFailure": null,
      "lastCompletionResult": {
        "payloads": [
          {
            "metadata": {
              "encoding": "anNvbi9wbGFpbg=="
            },
            "data": "InN1Y2Nlc3Mi"
          }
        ]
      },
      "originalExecutionRunId": "5872de82-3541-4d47-b11e-e49168987c10",
      "identity": "",
      "firstExecutionRunId": "9546ae2b-cbbd-407d-95ac-f88732c20a99",
      "retryPolicy": {
        "initialInterval": "1s",
        "backoffCoefficient": 2,
        "maximumInterval": "20s",
        "maximumAttempts": 3,
        "nonRetryableErrorTypes": []
      },
      "attempt": 1,
      "workflowExecutionExpirationTime": null,
      "cronSchedule": "* * * * *",
      "firstWorkflowTaskBackoff": "60s",
      "memo": null,
      "searchAttributes": null,
      "prevAutoResetPoints": {
        "points": [
          {
            "binaryChecksum": "e09e4f7f586b102723272ca02a06a1ed",
            "runId": "9546ae2b-cbbd-407d-95ac-f88732c20a99",
            "firstWorkflowTaskCompletedId": "4",
            "createTime": "2023-04-11T03:05:00.820153573Z",
            "expireTime": "2023-04-14T03:05:00.901427868Z",
            "resettable": true
          },
          {
            "binaryChecksum": "76b3d3238b8dcf11c2cb3c82cf2903d0",
            "runId": "82449981-2ca9-42b0-9068-ea28fa6ed570",
            "firstWorkflowTaskCompletedId": "4",
            "createTime": "2023-04-13T07:36:00.695644487Z",
            "expireTime": "2023-04-16T07:36:00.749381633Z",
            "resettable": true
          }
        ]
      },
      "header": {
        "fields": {}
      },
      "parentInitiatedEventVersion": "0"
    }
  },
  {
    "eventId": "2",
    "eventTime": "2023-04-16T11:50:00.301273463Z",
    "eventType": "WorkflowTaskScheduled",
    "version": "0",
    "taskId": "4468303",
    "workerMayIgnore": false,
    "workflowTaskScheduledEventAttributes": {
      "taskQueue": {
        "name": "feature-palette",
        "kind": "Normal"
      },
      "startToCloseTimeout": "10s",
      "attempt": 1
    }
  },
  {
    "eventId": "3",
    "eventTime": "2023-04-16T11:50:50.302433608Z",
    "eventType": "WorkflowExecutionTimedOut",
    "version": "0",
    "taskId": "4468307",
    "workerMayIgnore": false,
    "workflowExecutionTimedOutEventAttributes": {
      "retryState": "InProgress",
      "newExecutionRunId": "8bc15396-163b-42d9-9e3d-e0124cf8cf33"
    }
  }
]

Do you have workers polling on the feature-palette task queue?
Could check with cli or tctl, for example:

temporal task-queue describe -t feature-palette

tctl tq desc -tq feature-palette

What’s the reason to set workflow run timeout for this cron execution to 50s?
Service in this case correctly scheduled the first workflow task and timed out the execution when workflow run timeout was reached, so not sure if this might be related to original issue. Would check if you have workers polling. If you do maybe check service errors:

sum(rate(service_error_with_type{service_type="frontend"}[5m])) by (error_type)

and your workers workflow task schedule to start latencies:

histogram_quantile(0.95, sum(rate(temporal_workflow_task_schedule_to_start_latency_bucket[5m])) by (namespace, le))

1.There are three workers polling on the feature-palette. If there are no workers available in the feature-palette task queue, all the workflow run instances on the feature-palette will timeout. However, some workflow run instances occasionally encounter this problem.

2.I need to run this workflow every minute, so each run instance of the workflow must be completed within 50 seconds in order not to affect the execution of the next minute. WorkflowTaskScheduled is a very quick process and should not take 50 seconds to complete. My Temporal cluster is only running two workflows.

3.As for the issue with service errors, there are some errors in the namespace I’m using, but the time when the errors occurred does not match the time when the WorkflowTaskScheduled was hung up.



4. The metric temporal_workflow_task_schedule_to_start_latency_bucket only contains data from the namespace temporal_system . There were only two data points recorded for this metric in one day.