Temporal server with the error "UpdateTaskQueue failed"

zhangkyou · April 7, 2023, 3:21am

I found there is many error log in my Temporal server’s log.

{"level":"error","ts":"2023-04-07T10:36:11.662+0800","msg":"Operation failed with internal error.","error":"UpdateTaskQueue failed. Failed to start transaction. Error: invalid connection","operation":"UpdateTaskQueue","logging-call-at":"persistenceMetricClients.go:1171","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/common/log/zap_logger.go:150\ngo.temporal.io/server/common/persistence.updateErrorMetric\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/common/persistence/persistenceMetricClients.go:1171\ngo.temporal.io/server/common/persistence.(*metricEmitter).recordRequestMetrics\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/common/persistence/persistenceMetricClients.go:1148\ngo.temporal.io/server/common/persistence.(*taskPersistenceClient).UpdateTaskQueue.func1\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/common/persistence/persistenceMetricClients.go:555\ngo.temporal.io/server/common/persistence.(*taskPersistenceClient).UpdateTaskQueue\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/common/persistence/persistenceMetricClients.go:557\ngo.temporal.io/server/service/matching.(*taskQueueDB).updateTaskQueue\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/service/matching/db.go:376\ngo.temporal.io/server/service/matching.(*taskQueueDB).UpdateState\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/service/matching/db.go:186\ngo.temporal.io/server/service/matching.(*taskReader).persistAckLevel\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/service/matching/taskReader.go:297\ngo.temporal.io/server/service/matching.(*taskReader).getTasksPump\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/service/matching/taskReader.go:209\ngo.temporal.io/server/internal/goro.(*Group).Go.func1\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/internal/goro/group.go:58"}

{"level":"error","ts":"2023-04-07T10:36:11.662+0800","msg":"Persistent store operation failure","service":"matching","component":"matching-engine","wf-task-queue-name":"/_sys/temporal-sys-tq-scanner-taskqueue-0/1","wf-task-queue-type":"Activity","wf-namespace":"temporal-system","store-operation":"update-task-queue","error":"UpdateTaskQueue failed. Failed to start transaction. Error: invalid connection","logging-call-at":"taskReader.go:212","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/common/log/zap_logger.go:150\ngo.temporal.io/server/service/matching.(*taskReader).getTasksPump\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/service/matching/taskReader.go:212\ngo.temporal.io/server/internal/goro.(*Group).Go.func1\n\t/gaia/workspace-job/git.xiaojukeji.com/soda-server/temporal-server/internal/internal/goro/group.go:58"}

Temporal server version: 1.20.0
Mysql version: v57

It seems that the error was reported from this part of the code.The m.txExecute

func (m *sqlTaskManager) UpdateTaskQueue(
	ctx context.Context,
	request *persistence.InternalUpdateTaskQueueRequest,
) (*persistence.UpdateTaskQueueResponse, error) {
	nidBytes, err := primitives.ParseUUID(request.NamespaceID)
	if err != nil {
		return nil, serviceerror.NewInternal(err.Error())
	}

	tqId, tqHash := m.taskQueueIdAndHash(nidBytes, request.TaskQueue, request.TaskType)
	var resp *persistence.UpdateTaskQueueResponse
	err = m.txExecute(ctx, "UpdateTaskQueue", func(tx sqlplugin.Tx) error {
		if err := lockTaskQueue(ctx,
			tx,
			tqHash,
			tqId,
			request.PrevRangeID,
		); err != nil {
			return err
		}
		result, err := tx.UpdateTaskQueues(ctx, &sqlplugin.TaskQueuesRow{
			RangeHash:    tqHash,
			TaskQueueID:  tqId,
			RangeID:      request.RangeID,
			Data:         request.TaskQueueInfo.Data,
			DataEncoding: request.TaskQueueInfo.EncodingType.String(),
		})
		if err != nil {
			return err
		}
		rowsAffected, err := result.RowsAffected()
		if err != nil {
			return err
		}
		if rowsAffected != 1 {
			return fmt.Errorf("%v rows were affected instead of 1", rowsAffected)
		}
		resp = &persistence.UpdateTaskQueueResponse{}
		return nil
	})
	return resp, err
}

The issue is that the sample workflow occasionally gets stuck at either the WorkflowTaskScheduled event or the ActivityTaskSchedule event, and remains there until the workflow eventually times out.

I’m not sure if this issue is related to the error logs mentioned above.

tihomir · April 7, 2023, 3:58pm

Error: invalid connection

Looks like issues connecting to your persistence store. Does this only happen with matching hosts? Check your matching hosts static configs (/etc/temporal/config/docker.yaml).

Maybe also worth checking persistence latencies:

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

and resource exhausted on SystemOverloaded cause:

sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

zhangkyou · April 10, 2023, 2:44am

The metric persistence_latency_bucket appears to be normal.

And the metric service_errors_resource_exhausted didn’t emit.

zhangkyou · April 10, 2023, 7:04am

We are using a DBProxy in front of our MySQL database. Could you please tell me what SQL statements are used in the m.txExecute method? I want to check if it’s because our DBProxy does not support them.

zhangkyou · April 10, 2023, 7:50am

Yes，it only happen with matching service.My frontend, matching, history, worker services use the same static config. I start the service with the cmd

temporal-server --config="config/" start --services="frontend,matching,history,worker"

tihomir · April 14, 2023, 2:34pm

Any relevant errors in your dbproxy logs?
The errors you shared point to connection issues, “Error: invalid connection”.

The issue is that the sample workflow occasionally gets stuck at either the WorkflowTaskScheduled event or the ActivityTaskSchedule event, and remains there until the workflow eventually times out.

Also if you can share event history of one of those “stuck” executions please.

zhangkyou · April 17, 2023, 3:06am

There isn’t any log in dbproxy.

And there is a stuck workflow that hang at WorkflowTaskScheduled event.

[
  {
    "eventId": "1",
    "eventTime": "2023-04-16T11:49:00.299533562Z",
    "eventType": "WorkflowExecutionStarted",
    "version": "0",
    "taskId": "4468293",
    "workerMayIgnore": false,
    "workflowExecutionStartedEventAttributes": {
      "workflowType": {
        "name": "FeaturePaletteProducerWorkflow"
      },
      "parentWorkflowNamespace": "",
      "parentWorkflowNamespaceId": "",
      "parentWorkflowExecution": null,
      "parentInitiatedEventId": "0",
      "taskQueue": {
        "name": "feature-palette",
        "kind": "Normal"
      },
      "input": {
        "payloads": [
          {
            "metadata": {
              "encoding": "anNvbi9wbGFpbg=="
            },
            "data": "eyJmZWF0dXJlX25hbWUiOiJhcmVhLkRpc3BhdGNoLmRpc3BhdGNoX3N0b2NrZWQiLCJzcWwiOiJzZWxlY3QgY2l0eV9pZCBhcyBjaXR5X2lkLCBmcm9tX2FyZWFfaWQgYXMgb2JqZWN0X2lkLCAnYXJlYTpEaXNwYXRjaDpkaXNwYXRjaF9zdG9ja2VkJyBhcyBmZWF0dXJlX25hbWUsIHN1bShjYXNlIHdoZW4gZmlyc3Rfc2NvcmVfdHMgPSAwIHRoZW4gMSBlbHNlIDAgZW5kKSAgYXMgZmVhdHVyZV92YWx1ZSBmcm9tIGRfZGVsaXZlcnlfYWxsICAgIHdoZXJlIChjcmVhdGVfdGltZSBcdTAwM2UgVU5JWF9USU1FU1RBTVAoKSAtIDYwMCBhbmQgZnJvbV9hcmVhX2lkIFx1MDAzZSAwKSAgICBncm91cCBieSBjaXR5X2lkLCBmcm9tX2FyZWFfaWQiLCJkYXRhX3NvdXJjZSI6ImRlbGl2ZXJ5X29yZGVyX3dpZGVfdGFibGUiLCJjcm9uX3N0ciI6IiogKiAqICogKiIsInRpbWVvdXQiOjUwMDAwMDAwMDAwfQ=="
          }
        ]
      },
      "workflowExecutionTimeout": "0s",
      "workflowRunTimeout": "50s",
      "workflowTaskTimeout": "10s",
      "continuedExecutionRunId": "b6f741c8-325c-464b-860c-74ce58352091",
      "initiator": "CronSchedule",
      "continuedFailure": null,
      "lastCompletionResult": {
        "payloads": [
          {
            "metadata": {
              "encoding": "anNvbi9wbGFpbg=="
            },
            "data": "InN1Y2Nlc3Mi"
          }
        ]
      },
      "originalExecutionRunId": "5872de82-3541-4d47-b11e-e49168987c10",
      "identity": "",
      "firstExecutionRunId": "9546ae2b-cbbd-407d-95ac-f88732c20a99",
      "retryPolicy": {
        "initialInterval": "1s",
        "backoffCoefficient": 2,
        "maximumInterval": "20s",
        "maximumAttempts": 3,
        "nonRetryableErrorTypes": []
      },
      "attempt": 1,
      "workflowExecutionExpirationTime": null,
      "cronSchedule": "* * * * *",
      "firstWorkflowTaskBackoff": "60s",
      "memo": null,
      "searchAttributes": null,
      "prevAutoResetPoints": {
        "points": [
          {
            "binaryChecksum": "e09e4f7f586b102723272ca02a06a1ed",
            "runId": "9546ae2b-cbbd-407d-95ac-f88732c20a99",
            "firstWorkflowTaskCompletedId": "4",
            "createTime": "2023-04-11T03:05:00.820153573Z",
            "expireTime": "2023-04-14T03:05:00.901427868Z",
            "resettable": true
          },
          {
            "binaryChecksum": "76b3d3238b8dcf11c2cb3c82cf2903d0",
            "runId": "82449981-2ca9-42b0-9068-ea28fa6ed570",
            "firstWorkflowTaskCompletedId": "4",
            "createTime": "2023-04-13T07:36:00.695644487Z",
            "expireTime": "2023-04-16T07:36:00.749381633Z",
            "resettable": true
          }
        ]
      },
      "header": {
        "fields": {}
      },
      "parentInitiatedEventVersion": "0"
    }
  },
  {
    "eventId": "2",
    "eventTime": "2023-04-16T11:50:00.301273463Z",
    "eventType": "WorkflowTaskScheduled",
    "version": "0",
    "taskId": "4468303",
    "workerMayIgnore": false,
    "workflowTaskScheduledEventAttributes": {
      "taskQueue": {
        "name": "feature-palette",
        "kind": "Normal"
      },
      "startToCloseTimeout": "10s",
      "attempt": 1
    }
  },
  {
    "eventId": "3",
    "eventTime": "2023-04-16T11:50:50.302433608Z",
    "eventType": "WorkflowExecutionTimedOut",
    "version": "0",
    "taskId": "4468307",
    "workerMayIgnore": false,
    "workflowExecutionTimedOutEventAttributes": {
      "retryState": "InProgress",
      "newExecutionRunId": "8bc15396-163b-42d9-9e3d-e0124cf8cf33"
    }
  }
]

tihomir · April 17, 2023, 3:30am

Do you have workers polling on the feature-palette task queue?
Could check with cli or tctl, for example:

temporal task-queue describe -t feature-palette

tctl tq desc -tq feature-palette

What’s the reason to set workflow run timeout for this cron execution to 50s?
Service in this case correctly scheduled the first workflow task and timed out the execution when workflow run timeout was reached, so not sure if this might be related to original issue. Would check if you have workers polling. If you do maybe check service errors:

sum(rate(service_error_with_type{service_type="frontend"}[5m])) by (error_type)

and your workers workflow task schedule to start latencies:

histogram_quantile(0.95, sum(rate(temporal_workflow_task_schedule_to_start_latency_bucket[5m])) by (namespace, le))

zhangkyou · April 20, 2023, 7:55am

1.There are three workers polling on the feature-palette. If there are no workers available in the feature-palette task queue, all the workflow run instances on the feature-palette will timeout. However, some workflow run instances occasionally encounter this problem.

2.I need to run this workflow every minute, so each run instance of the workflow must be completed within 50 seconds in order not to affect the execution of the next minute. WorkflowTaskScheduled is a very quick process and should not take 50 seconds to complete. My Temporal cluster is only running two workflows.

3.As for the issue with service errors, there are some errors in the namespace I’m using, but the time when the errors occurred does not match the time when the WorkflowTaskScheduled was hung up.

4. The metric temporal_workflow_task_schedule_to_start_latency_bucket only contains data from the namespace temporal_system . There were only two data points recorded for this metric in one day.

Topic		Replies	Views
Scheduled WorkflowTasks occasionally not starting Community Support	15	1153	July 1, 2024
DEADLINE_EXCEEDED: deadline exceeded after 9.999933037s Community Support java-sdk	9	2426	July 13, 2023
Occasionally workflow task won't be started after scheduled Community Support	16	614	November 9, 2022
Persistence layer problems Community Support	4	694	December 16, 2024
Frontend service continuously spewing errors Community Support mysql , deployment	3	2548	July 30, 2020

Temporal server with the error "UpdateTaskQueue failed"

Related topics