Go SDK Troubleshooting

Chad_Retz · April 14, 2022, 3:07pm

Go Troubleshooting

What to do when you encounter a bug

When an unexpected error/bug is found in a workflow or activity, there are many ways to fix it.

The most important thing you can do is reliably replicate the bug

If a bug is not reliably replicated, you will not be able to do advanced troubleshooting to fix it nor will you even know you have fixed it. This is especially important when contacting support. Some bugs are harder to replicate than others. Multiple approaches for replication and debugging are outlined later in Troubleshooting approaches.

Support

If you are unable to solve your bug with any of the techniques in this document and/or you believe it is the result of a bug in Temporal itself, Temporal support can be contacted via the Community Forum or Community Slack. If it’s a Temporal bug, an issue or PR can be opened on the appropriate Temporal GitHub project.

When communicating with support, to help Temporal diagnose the issue, providing the following will be very helpful:

SDK version
Server version or Temporal Cloud URL
SDK and server logs (including any stack traces)
Workflow history
Standalone code to replicate

Common problems

While using the Go SDK, you may run into problems running workflows and activities. Some of the common problems and troubleshooting approaches are listed below.

Workflow non-determinism

If a workflow follows a different code path during a replay than it did during its original execution (i.e. results in different events), it may result in a non-determinism error. This results in a panic internally which may bubble out to an error on the workflow that could contain message content like any of the following:

lookup failed for scheduledEventID to activityID: scheduleEventID: 5, activityID: 5 [...]

unable to find activityID for the scheduledEventID [...]

unknown command CommandType: [...]

nondeterministic workflow: [...]

No cached result found for side effectID[...]

This means the code path is non-deterministic. Some standard Go features cannot be used in workflows. See the following for more information:

To troubleshoot or locate non-determinisms in code, see the Troubleshooting Approaches section below.

Workflow deadlock

The actual code of a workflow is meant to execute in a few milliseconds until all coroutines reach a yield point waiting on the Temporal server (or the primary workflow returns). Then the server may return more information and the coroutines are unsuspended again and expected to run in a few milliseconds again.

Therefore, a workflow should not do anything time consuming or very CPU bound in the workflow. This is important because workflow code is often replayed for various reasons. To enforce this, Temporal requires that the code run within 1 second or it is considered “deadlocked”.

When workflow code runs longer than a second, an internal panic will occur that will bubble out as a workflow error containing something like:

Potential deadlock detected: workflow goroutine "somename" didn't yield for over a second

This means something in the context of a workflow run is too slow. This may not only be inside the workflow, but if a custom data converter is too slow this may also happen. A workflow should not contain CPU bound or otherwise possibly-slow code. Therefore any sort of network/disk IO should not be done in a workflow.

To troubleshoot or locate deadlocks in code, see the Troubleshooting Approaches section below.

Troubleshooting approaches

Below are several approaches to take for troubleshooting problems when using the Go SDK.

Worker metrics and logs

Workers can be configured to emit metrics and logs. By default no metrics are emitted and all logs are printed in a very basic way.

A Logger can be set on the client.Options that implements a very basic Logger interface. An example implementation using zap can be found in the samples repo. Logs can be used to help diagnose issues.

Also, workers can emit several metrics. See SDK Metrics for all the metrics that are emitted. An example of emitting metrics via Prometheus can be found in the samples repo.

When a worker is struggling to run workflows or activities, logs and metrics can help determine why. See Temporal Workers Tuning for details on how to use the metrics to determine which options to set for workers.

Unit test framework

Using the unit test framework can be a good way to troubleshoot simple bugs. See Testing for how to use the test framework. Of course many bugs may not be caught/replicated by the testing framework along.

Local server

Despite the flexibility of unit testing, Temporal recommends all workflows and activities be integration tested as much as is reasonable. This means running real workflows and activities using a real Temporal server. This same local server can also be used for simply running workflows for debugging purposes in a development environment.

A full local Temporal server can easily be started very quickly for use by tests or simple single-file Go apps to execute workflows. This is actually how the Temporal Go SDK does its integration tests using the Docker Compose quick-install approach.

DataDog also maintains Temporalite which runs an embedded Temporal server using SQLite. This too is very easy to use in simple Go scripts or test suites. A temporaltest package is even made available specifically for integration tests.

See Debugging for information on how to step through code during local development.

Standalone Go files that use a localhost Temporal server are the recommended way to submit bug reproductions to the SDK team.

Replaying

When a non-determinism or deadlock occurs, often it is for specific input or in specific circumstances. Temporal offers the ability to “replay” existing history to attempt to replicate these errors. This uses the worker.WorkflowReplayer.

To use the replayer, first the history must be obtained for the workflow you would like to replay. This can be exported as JSON from the Temporal web interface or done programmatically like so:

import (
	"context"

	"go.temporal.io/api/enums/v1"
	"go.temporal.io/api/history/v1"
	"go.temporal.io/sdk/client"
)

func GetWorkflowHistory(ctx context.Context, client client.Client, id, runID string) (*history.History, error) {
	var hist history.History
	iter := client.GetWorkflowHistory(ctx, id, runID, false, enums.HISTORY_EVENT_FILTER_TYPE_ALL_EVENT)
	for iter.HasNext() {
		event, err := iter.Next()
		if err != nil {
			return nil, err
		}
		hist.Events = append(hist.Events, event)
	}
	return &hist, nil
}

This history can then be used to “replay” the workflow like so:

import (
	"context"

	"go.temporal.io/sdk/client"
	"go.temporal.io/sdk/worker"
)

func ReplayWorkflow(ctx context.Context, client client.Client, id, runID string) error {
	hist, err := GetWorkflowHistory(ctx, client, id, runID)
	if err != nil {
		return err
	}
	replayer := worker.NewWorkflowReplayer()
	replayer.RegisterWorkflow(MyWorkflow)
	return replayer.ReplayWorkflowHistory(nil, hist)
}

This will run the exact same history that was generated in the original run. If a noticeably different code path was followed or some code caused a deadlock, it will be reported. Many larger companies using Temporal regularly download and execute past histories to ensure their past workflows work as expected even if minor changes are made.

Replaying a workflow locally is a good way to see exactly what code path was taken for given input and events. This is especially beneficial when manually debugging. See Debugging for information.

Debugging

A normal Go IDE debugger can be used with Temporal. This is helpful when running workflows on a local server or replaying or even during unit tests. In order to not have the workflow seen as deadlocking while stepping through code in the debugger, the TEMPORAL_DEBUG environment variable must be set to true.

Debugging with a debugger is especially helpful when replaying. Taking an already executed workflow’s history and running it in a replayer with debugging will allow you to step through the actual workflow code that executed when the workflow was originally ran, complete with the same local values and in the same order.

Static analysis

Non-determinism can be hard to catch while writing workflows. The Go SDK does not have a restricted runtime to prevent use of time.Sleep or go for a new goroutine inside of a workflow. Calling those or any other invalid Go workflow constructs can lead to ugly non-determinism errors.

To help catch these early, the workflowcheck static analysis tool has been created. It attempts to find all invalid code that can be called from inside a workflow. Please see the workflowcheck README for details on how to use.

Stack traces

Sometimes a workflow is stuck waiting on something external to happen (e.g. a result from an activity or a signal). A special query can be made to the workflow to dump the Go stack of all coroutines that are currently waiting in the workflow.

From the tctl CLI, this can be called via tctl stack. This can also be done programmatically by just issuing a normal client.Client.Query using the client.QueryTypeStackTrace query name and expecting a string response. Viewing the stack traces of the coroutines can make it clear where in code a workflow is currently paused.

Topic		Replies	Views
Non-determinism issue while replaying mutable side-effect behind workflow versioning Community Support go-sdk , versioning , replay	3	663	June 16, 2023
How can we get alerted about non deterministic or panic error Community Support go-sdk	2	531	January 28, 2024
Replay for non-deterministic change Community Support go-sdk	2	936	June 21, 2023
Need help with A non-deterministic error has caused the Workflow Task to hang Community Support go-sdk	1	53	August 28, 2024
Workflowcheck alerts on using temporal sdk function Community Support go-sdk	1	443	May 30, 2022