Operation failed with error

Self-hosting Temporal server (currently v1.12.3) in K8s (EKS) and seeing the following errors on every workflow invocation:

{"level":"error","ts":"2022-02-07T22:48:59.553Z","msg":"Operation failed with an error.","logging-call-at":"visibility_manager_metrics.go:255","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence/visibility.(*visibilityPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/visibility/visibility_manager_metrics.go:255\ngo.temporal.io/server/common/persistence/visibility.(*visibilityPersistenceClient).RecordWorkflowExecutionStarted\n\t/temporal/common/persistence/visibility/visibility_manager_metrics.go:68\ngo.temporal.io/server/common/persistence/visibility.(*visibilityManagerWrapper).RecordWorkflowExecutionStarted\n\t/temporal/common/persistence/visibility/visibility_manager_dual_write.go:78\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).recordStartExecution\n\t/temporal/service/history/visibilityQueueTaskExecutor.go:262\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).processStartOrUpsertExecution\n\t/temporal/service/history/visibilityQueueTaskExecutor.go:188\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).execute\n\t/temporal/service/history/visibilityQueueTaskExecutor.go:114\ngo.temporal.io/server/service/history.(*visibilityQueueProcessorImpl).process\n\t/temporal/service/history/visibilityQueueProcessor.go:316\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskOnce\n\t/temporal/service/history/taskProcessor.go:269\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:221\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"}
{"level":"error","ts":"2022-02-07T22:48:59.915Z","msg":"Operation failed with an error.","logging-call-at":"visibility_manager_metrics.go:255","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence/visibility.(*visibilityPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/visibility/visibility_manager_metrics.go:255\ngo.temporal.io/server/common/persistence/visibility.(*visibilityPersistenceClient).RecordWorkflowExecutionClosed\n\t/temporal/common/persistence/visibility/visibility_manager_metrics.go:81\ngo.temporal.io/server/common/persistence/visibility.(*visibilityManagerWrapper).RecordWorkflowExecutionClosed\n\t/temporal/common/persistence/visibility/visibility_manager_dual_write.go:94\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).recordCloseExecution\n\t/temporal/service/history/visibilityQueueTaskExecutor.go:421\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).processCloseExecution\n\t/temporal/service/history/visibilityQueueTaskExecutor.go:371\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).execute\n\t/temporal/service/history/visibilityQueueTaskExecutor.go:118\ngo.temporal.io/server/service/history.(*visibilityQueueProcessorImpl).process\n\t/temporal/service/history/visibilityQueueProcessor.go:316\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskOnce\n\t/temporal/service/history/taskProcessor.go:269\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:221\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"}

So all I have to go on here is “Operation failed with an error” and a stack trace.

The stack trace points to this bit of code:

func (p *visibilityPersistenceClient) updateErrorMetric(scope metrics.Scope, err error) {
	switch err.(type) {
	case *persistence.ConditionFailedError:
		scope.IncCounter(metrics.VisibilityPersistenceConditionFailed)
	case *persistence.TimeoutError:
		scope.IncCounter(metrics.VisibilityPersistenceTimeout)
		scope.IncCounter(metrics.VisibilityPersistenceFailures)
	case *serviceerror.NotFound:
		scope.IncCounter(metrics.VisibilityPersistenceNotFound)
	case *serviceerror.ResourceExhausted:
		scope.IncCounter(metrics.VisibilityPersistenceResourceExhausted)
		scope.IncCounter(metrics.VisibilityPersistenceFailures)
	default:
		p.logger.Error("Operation failed with an error.", tag.Error(err))
		scope.IncCounter(metrics.VisibilityPersistenceFailures)
	}
}

So there appears to be an unhandled error, but that error is not being logged.

The DB (RDS MySQL) health metrics look fine. I confirmed that the history service has connectivity to the database (exec’ed onto the container, installed a mysql client and connected to the DB). Any advice on what I should do next for troubleshooting?

This is a phantom error. I mean, there is no error and err is actually nil (this is why there is no error message in the log) but check for nil was accidentally removed and nil falls into default case.

There are two way to mitigate this:

  1. Upgrade to 1.13.x (whatever is the latest). Visibility was refactored there and the check is back.
  2. Use advanced visibility with Elasticsearch because there is another metrics wrapper for advanced visibility and it also has that nil check.