A sub-process with workflow.Sleep() encountered non-deterministic error after Cassandra server migration

Our workflow has a sub-process, in which it will call an activity (check delivery status), check the result, then wait (sleep) for 3 hours and repeat the process if delivery is not success. The loop will go on until 2 days have passed:

The sleep is implemented using workflow.Sleep(ctx, subProcessDurationLeft). In Temporal Web UI, the subprocess will look like (image 2).

In production environment, there was a Cassandra server migration activity. It affected some running workflow transactions. Some transactions encountered WF Task Timed Out (image 3).

We gathered some error messages from Temporal Server log:

  • Corrupted non-contiguous event batch
  • Operation failed with internal error.
  • encounter data loss event
  • unavailable error, data loss

When Cassandra server was finally available (running), the transaction could run normally (image 4).

However, when the next sleep timer was triggered, non-deterministic error occurred (image 5).

We suspected that the sleep timer conflicted. When Cassandra Server was shut down, the timer was continued only after Cassandra Server was available, and it was already the time for another sleep timer to begin.

We tried to replicate this issue in NFT scenario, we tried to shut down Cassandra Server, and Temporal Server. However, no non-deterministic error occurred. The transaction ran normally when the servers were available.

Is there any way to find the root cause of the NDE in production? Is there also any way to prevent this scenario from happening again?

Go versions:
go.temporal.io/api v1.6.1-0.20211110205628-60c98e9cbfe2
go.temporal.io/sdk v1.13.1

Image 2:
image

Image 3:
image

Image 4:
image

Image 5:
image

The Temporal Server version is 1.13.1
Cassandra Server version is 4.0.1

We suspected that the sleep timer conflicted. When Cassandra Server was shut down, the timer was continued only after Cassandra Server was available, and it was already the time for another sleep timer to begin.

Can you show the full event history for this execution please?

tctl wf show -w <wfid> -r <runid> --of myhistory.json
(and share myhistory.json)

Can you show how in your workflow code you are calculating the sleep duration?

Hi Tihomir, I’ll try to get the event history.
This is how the sleep duration is calculated:

deliveryHasIncident := false
subProcessDurationLeft := time.Duration(deliveryStatusSubProcessTimerInMinute) * time.Minute
sleepDuration := time.Duration(deliveryStatusSleepTimerInMinute) * time.Minute

for true {
_ = commonutil.UpsertSearchAtb(ctx, utility.GetDeliveryOrder, commonutil.ServiceTask, “”, queryState)
var getDeliveryOrderRes GetOrderDeliveryResponse
_ = workflow.ExecuteActivity(ctx, wf.getOrderDeliveryAct.GetOrderDelivery, getDeliveryOrderReq).Get(ctx, &getDeliveryOrderRes)

if getDeliveryOrderRes.Transaction.StatusCode == SUCCESS {
	break
} else {
	if subProcessDurationLeft <= 0 {
		//Timeout
		deliveryHasIncident = true
		break
	} else if subProcessDurationLeft < sleepDuration {
		//Sleep with remaining time
		_ = commonutil.UpsertSearchAtb(ctx, utility.PauseGetDeliveryOrder, commonutil.ServiceTask, "", queryState)
		_ = workflow.Sleep(ctx, subProcessDurationLeft)
	} else {
		//Sleep with default sleep duration
		_ = commonutil.UpsertSearchAtb(ctx, utility.PauseGetDeliveryOrder, commonutil.ServiceTask, "", queryState)
		_ = workflow.Sleep(ctx, sleepDuration)
	}
	subProcessDurationLeft = subProcessDurationLeft - sleepDuration
}

}

Are deliveryStatusSubProcessTimerInMinute, deliveryStatusSleepTimerInMinute workflow inputs or are they calculated as well?

Hi Tihomir, both are inputs from environment variables.

Are you directly reading these variables or using an activity or side effect? If directly this is not deterministic as the workflow code path change if the value changes.

Hi Maxim,
The service class reads the values from env var and passes them to workflow when calling ExecuteWorkflow(). If the env var changes, the value will not apply for currently running workflow. It will only apply the next time ExecuteWorkflow() is called. Therefore in the workflow code, it is considered constant. Please correct me if I’m wrong.

This is the partial log that I can gather:

{“log”:“{"level":"info","ts":"2022-11-29T22:40:59.551Z","msg":"Queue processor pump shut down.","service":"history","shard-id":105,"address":"10.59.68.76:7234","shard-item":"0xc00acb7600","component":"visibility-queue-processor","logging-call-at":"queueProcessor.go:249"}\r\n”,“stream”:“stdout”,“time”:“2022-11-29T22:40:59.551205925Z”}
{“log”:“{"level":"info","ts":"2022-11-29T22:40:59.551Z","msg":"Task processor shutdown.","service":"history","shard-id":105,"address":"10.59.68.76:7234","shard-item":"0xc00acb7600","component":"visibility-queue-processor","logging-call-at":"taskProcessor.go:155"}\r\n”,“stream”:“stdout”,“time”:“2022-11-29T22:40:59.551251196Z”}
{“log”:“{"level":"info","ts":"2022-11-29T22:40:59.551Z","msg":"none","service":"history","shard-id":105,"address":"10.59.68.76:7234","shard-item":"0xc00acb7600","component":"visibility-queue-processor","lifecycle":"Stopped","component":"transfer-queue-processor","logging-call-at":"queueProcessor.go:179"}\r\n”,“stream”:“stdout”,“time”:“2022-11-29T22:40:59.551274493Z”}
{“log”:“{"level":"info","ts":"2022-11-29T22:40:59.551Z","msg":"none","service":"history","shard-id":105,"address":"10.59.68.76:7234","shard-item":"0xc00acb7600","component":"history-engine","lifecycle":"Stopped","logging-call-at":"historyEngine.go:331"}\r\n”,“stream”:“stdout”,“time”:“2022-11-29T22:40:59.551278235Z”}
{“log”:“{"level":"info","ts":"2022-11-29T22:40:59.551Z","msg":"none","service":"history","shard-id":105,"address":"10.59.68.76:7234","shard-item":"0xc00acb7600","lifecycle":"Stopped","component":"shard-engine","logging-call-at":"controller_impl.go:463"}\r\n”,“stream”:“stdout”,“time”:“2022-11-29T22:40:59.551287714Z”}
{“log”:“{"level":"error","ts":"2022-11-29T22:41:21.407Z","msg":"Persistent store operation failure","service":"matching","component":"matching-engine","wf-task-queue-name":"temporal-sys-history-scanner-taskqueue-0","wf-task-queue-type":"Workflow","store-operation":"update-task-queue","error":"Persistence Max QPS Reached.","logging-call-at":"taskReader.go:171","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/matching.(*taskReader).getTasksPump\n\t/temporal/service/matching/taskReader.go:171\ngo.temporal.io/server/internal/goro.(*Group).Go.func1\n\t/temporal/internal/goro/group.go:57"}\r\n”,“stream”:“stdout”,“time”:“2022-11-29T22:41:21.407692761Z”}
{“log”:“{"level":"error","ts":"2022-11-29T22:41:54.586Z","msg":"Corrupted non-contiguous event batch","wf-first-event-id":24,"wf-next-event-id":25,"token-last-event-id":24,"counter":2,"logging-call-at":"history_manager.go:687","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/common/persistence.(*executionManagerImpl).readHistoryBranch\n\t/temporal/common/persistence/history_manager.go:687\ngo.temporal.io/server/common/persistence.(*executionManagerImpl).ReadHistoryBranch\n\t/temporal/common/persistence/history_manager.go:458\ngo.temporal.io/server/common/persistence.(*executionRateLimitedPersistenceClient).ReadHistoryBranch\n\t/temporal/common/persistence/persistenceRateLimitedClients.go:605\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).ReadHistoryBranch\n\t/temporal/common/persistence/persistenceMetricClients.go:852\ngo.temporal.io/server/common/persistence.ReadFullPageEvents\n\t/temporal/common/persistence/history_manager_util.go:41\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).getHistory\n\t/temporal/service/frontend/workflowHandler.go:3074\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).createPollWorkflowTaskQueueResponse\n\t/temporal/service/frontend/workflowHandler.go:3242\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).PollWorkflowTaskQueue\n\t/temporal/service/frontend/workflowHandler.go:819\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue.func2\n\t/temporal/service/frontend/dcRedirectionHandler.go:532\ngo.temporal.io/server/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect\n\t/temporal/service/frontend/dcRedirectionPolicy.go:118\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue\n\t/temporal/service/frontend/dcRedirectionHandler.go:528\ngo.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler.func1\n\t/go/pkg/mod/go.temporal.io/api@v1.5.1-0.20211018190919-a5f4a169cd08/workflowservice/v1/service.pb.go:1088\ngo.temporal.io/server/common/authorization.(*interceptor).Interceptor\n\t/temporal/common/authorization/interceptor.go:152\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1113\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceCountLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_count_limit.go:89\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceRateLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_rate_limit.go:87\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/rate_limit.go:83\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/telemetry.go:108\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:131\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceLogInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_logger.go:83\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1118\ngo.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler\n\t/go/pkg/mod/go.temporal.io/api@v1.5.1-0.20211018190919-a5f4a169cd08/workflowservice/v1/service.pb.go:1090\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1279\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1608\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:923"}\r\n”,“stream”:“stdout”,“time”:“2022-11-29T22:41:54.587357126Z”}
{“log”:“{"level":"error","ts":"2022-11-29T22:41:54.587Z","msg":"Operation failed with internal error.","error":"corrupted history event batch, eventID is not contiguous","metric-scope":231,"logging-call-at":"persistenceMetricClients.go:1229","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:1229\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).ReadHistoryBranch\n\t/temporal/common/persistence/persistenceMetricClients.go:855\ngo.temporal.io/server/common/persistence.ReadFullPageEvents\n\t/temporal/common/persistence/history_manager_util.go:41\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).getHistory\n\t/temporal/service/frontend/workflowHandler.go:3074\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).createPollWorkflowTaskQueueResponse\n\t/temporal/service/frontend/workflowHandler.go:3242\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).PollWorkflowTaskQueue\n\t/temporal/service/frontend/workflowHandler.go:819\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue.func2\n\t/temporal/service/frontend/dcRedirectionHandler.go:532\ngo.temporal.io/server/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect\n\t/temporal/service/frontend/dcRedirectionPolicy.go:118\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue\n\t/temporal/service/frontend/dcRedirectionHandler.go:528\ngo.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler.func1\n\t/go/pkg/mod/go.temporal.io/api@v1.5.1-0.20211018190919-a5f4a169cd08/workflowservice/v1/service.pb.go:1088\ngo.temporal.io/server/common/authorization.(*interceptor).Interceptor\n\t/temporal/common/authorization/interceptor.go:152\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1113\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceCountLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_count_limit.go:89\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceRateLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_rate_limit.go:87\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/rate_limit.go:83\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/telemetry.go:108\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:131\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceLogInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_logger.go:83\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1118\ngo.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler\n\t/go/pkg/mod/go.temporal.io/api@v1.5.1-0.20211018190919-a5f4a169cd08/workflowservice/v1/service.pb.go:1090\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1279\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1608\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:923"}\r\n”,“stream”:“stdout”,“time”:“2022-11-29T22:41:54.587415281Z”}
{“log”:“{"level":"info","ts":"2022-11-29T22:42:24.749Z","msg":"Timer Task Processor: task event ID \u003e= MS NextEventID, skip.","service":"history","shard-id":82,"address":"10.59.68.76:7234","shard-item":"0xc02249e880","component":"timer-queue-processor","cluster-name":"active","wf-history-event-id":28,"wf-next-event-id":28,"logging-call-at":"nDCTaskUtil.go:159"}\r\n”,“stream”:“stdout”,“time”:“2022-11-29T22:42:24.750203936Z”}
{“log”:“{"level":"error","ts":"2022-11-30T00:46:14.443Z","msg":"Fail to process task","service":"history","shard-id":76,"address":"10.59.68.76:7234","shard-item":"0xc02249e200","component":"timer-queue-processor","cluster-name":"active","component":"timer-queue-processor","shard-id":76,"queue-task-id":31783387140,"queue-task-visibility-timestamp":"2022-11-30T00:46:14.430Z","xdc-failover-version":0,"queue-task-type":"WorkflowTaskTimeout","wf-namespace-id":"813cad55-65fd-45e0-9863-495d7d15b813","wf-id":"SBa0221127015706758894230","wf-run-id":"62162319-489c-4a41-b4ab-0ed8a68bec29","error":"conditional update failed","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:341","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/temporal/service/history/taskProcessor.go:341\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:222\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"}\r\n”,“stream”:“stdout”,“time”:“2022-11-30T00:46:14.443468276Z”}
{“log”:“{"level":"error","ts":"2022-11-30T04:12:49.718Z","msg":"getHistory: incomplete history","service":"frontend","wf-namespace-id":"813cad55-65fd-45e0-9863-495d7d15b813","wf-id":"SBa0221126105723859278e10","wf-run-id":"d62b6714-9bfe-462e-8532-78b6de259510","error":"Incomplete history: expected events [1-289] but got events [1-272] of length 272: isFirstPage=true,isLastPage=true,pageSize=256","logging-call-at":"workflowHandler.go:3104","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).getHistory\n\t/temporal/service/frontend/workflowHandler.go:3104\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).createPollWorkflowTaskQueueResponse\n\t/temporal/service/frontend/workflowHandler.go:3242\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).PollWorkflowTaskQueue\n\t/temporal/service/frontend/workflowHandler.go:819\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue.func2\n\t/temporal/service/frontend/dcRedirectionHandler.go:532\ngo.temporal.io/server/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect\n\t/temporal/service/frontend/dcRedirectionPolicy.go:118\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue\n\t/temporal/service/frontend/dcRedirectionHandler.go:528\ngo.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler.func1\n\t/go/pkg/mod/go.temporal.io/api@v1.5.1-0.20211018190919-a5f4a169cd08/workflowservice/v1/service.pb.go:1088\ngo.temporal.io/server/common/authorization.(*interceptor).Interceptor\n\t/temporal/common/authorization/interceptor.go:152\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1113\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceCountLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_count_limit.go:89\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceRateLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_rate_limit.go:87\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/rate_limit.go:83\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/telemetry.go:108\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:131\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceLogInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_logger.go:83\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1118\ngo.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler\n\t/go/pkg/mod/go.temporal.io/api@v1.5.1-0.20211018190919-a5f4a169cd08/workflowservice/v1/service.pb.go:1090\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1279\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1608\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:923"}\r\n”,“stream”:“stdout”,“time”:“2022-11-30T04:12:49.718995641Z”}
{“log”:“{"level":"error","ts":"2022-11-30T04:16:10.223Z","msg":"unavailable error, data loss","operation":"PollWorkflowTaskQueue","wf-namespace":"psbweb-1-namespace","error":"Incomplete history: expected events [1-290] but got events [1-272] of length 272: isFirstPage=true,isLastPage=true,pageSize=256","logging-call-at":"telemetry.go:184","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).handleError\n\t/temporal/common/rpc/interceptor/telemetry.go:184\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/telemetry.go:117\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:131\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceLogInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namespace_logger.go:83\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1118\ngo.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler\n\t/go/pkg/mod/go.temporal.io/api@v1.5.1-0.20211018190919-a5f4a169cd08/workflowservice/v1/service.pb.go:1090\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1279\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1608\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:923"}\r\n”,“stream”:“stdout”,“time”:“2022-11-30T04:16:10.224023729Z”}

{“log”:“{“level”:“error”,“ts”:“2022-11-29T22:41:21.407Z”,“msg”:“Persistent store operation failure”,“service”:“matching”,“component”:“matching-engine”,“wf-task-queue-name”:“temporal-sys-history-scanner-taskqueue-0”,“wf-task-queue-type”:“Workflow”,“store-operation”:“update-task-queue”,“error”:“Persistence Max QPS Reached.” …

For resource exhausted issues track server metrics:
sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

for matching qps limits dynamic config knob is matching.persistenceMaxQPS (default 3000)
my guess here is that your sync match rate is low:
sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))
causing your matching service to have to write tasks to db at a much higher volume, seems you have unprovisioned worker processes (number of workers / workflow task and activity pollers).

“error”:“corrupted history event batch, eventID is not contiguous”,

This looks like a data corruption issue on the db site. Can you check db logs to see what could be going on? Seems issue with history_node table.

Hi Tihomir,
Could you help to share the steps or command to gather those metrics?

Hi Tihomir,
I think it is related to server shutdown due to Cassandra server migration activity. There was also a power outage before the migration activity.

So I am actually trying to find out whether the non-deterministic error was caused by the code or only because of the server shut down, and is there any way to prevent this from happening again. When we tried to replicate the condition in NFT environment, it did not occur. This is the NDE in production env:

Could you help to share the steps or command to gather those metrics?

can you share how you are deploying your cluster? To enable server metrics in your static config see metrics section here. Then you would need to scrape these endpoints with Prometheus and set up Grafana too and add your prometheus data source.

So I am actually trying to find out whether the non-deterministic error was caused by the code or only because of the server shut down

Hard to tell but my guess is in this case workflow code as internal worker replay is done against history events that have been persisted already. Can you share the full event history please?
tctl wf show -w <wfid> -r <runid> --of myhistory.json?

Hi Tihomir,

We use the default method for deploying the cluster (docker built). We use default config_template.yaml. We only change the value of limit frontend history matching for performance tuning.

Unfortunately the transaction has been housekept.

Got it, if you see the non-deterministic error again please share history.

For server metrics set PROMETHEUS_ENDPOINT env var in your docker build command to for example “0.0.0.0:8000” and expose this port, then could see your metrics at localhost:8000/metrics.

You would need to also set up Prometheus with scrape config (see example here) and if you use Grafana could define config the pre-set data source and server metrics dashboard (see example here)