Activity retries and alerting

Hi!

I spent quite some time for the last few months to learn about Temporal, and now I’m planning to deploy it, I’ve got some questions about retrying an activity. I read different topics without finding an answer, so here I come !

Let’s say I’ve a simple workflow, which starts an Activity A.
This Activity A just calls an external http API.

From what I understand from the different guidelines I read, the retry policy for the activity A should be “no max retry”, and if the external API is down, we will just retry until it’s up. This approach looks good to me.

But how can I be alerted if there is too many retries for the activity ? Is this something I should handle on my own ?

I want to easily know when the activity is failing and retrying indefinitely, to know if the external API is down, or if I made a mistake on the URL I’m calling, etc.
I’m looking for a way to list “all workflows where an activity has been retried more than 5 times” for example.

Setting up a MaxRetry policy for Activity A to 5 would answer this usecase, because I can easily list failed workflows, but it does not seem to be the best approach from what I read.

The goal is obviously to know when a strange behaviour is happening during an activity execution and look for the root cause.

Thanks !

1 Like

For a single wf, you can lookg at the web-ui summary page for a particular workflow. Information under “Pending Activities” includes the activity type, retry attempt count, as well as the last failure info.

same with tctl “desctribe” command, for example:

tctl wf desc -w <my_workflow_id>

Note that you can get the retry attempt inside your activity code as well, for example using Java SDK:

Activity.getExecutionContext().getInfo().getAttempt();

For all workflows in a namespace, you can use the sdk client api to get all workflows who have pending activities with retries > X, for example:

private static void getActivitiesWithRetriesOver(int retryCount) {
        ListOpenWorkflowExecutionsRequest listOpenWorkflowExecutionsRequest =
                ListOpenWorkflowExecutionsRequest.newBuilder()
                        .setNamespace(client.getOptions().getNamespace())
                        .build();

        ListOpenWorkflowExecutionsResponse listOpenWorkflowExecutionsResponse =
                service.blockingStub().listOpenWorkflowExecutions(listOpenWorkflowExecutionsRequest);
        for(WorkflowExecutionInfo info : listOpenWorkflowExecutionsResponse.getExecutionsList()) {
            DescribeWorkflowExecutionRequest describeWorkflowExecutionRequest =
                    DescribeWorkflowExecutionRequest.newBuilder()
                            .setNamespace(client.getOptions().getNamespace())
                            .setExecution(info.getExecution()).build();
            DescribeWorkflowExecutionResponse describeWorkflowExecutionResponse =
                    service.blockingStub().describeWorkflowExecution(describeWorkflowExecutionRequest);
            for(PendingActivityInfo activityInfo : describeWorkflowExecutionResponse.getPendingActivitiesList()) {
                if(activityInfo.getAttempt() > retryCount) {
                    System.out.println("Activity Type: " + activityInfo.getActivityType());
                    System.out.println("Activity attempt: " + activityInfo.getAttempt());
                    System.out.println("Last failure message : " + activityInfo.getLastFailure().getMessage());
                    // ...
                }
            }
        }
    }
2 Likes

Yes you should rely on timeouts rather than RetryOptions->maximumAttempts. By default your retries will happen up to the activity ScheduleToCloseTimeout, if defined, if it’s not defined, they can retry up to the workflow run/execution timeout. If that is also not defined, then the retries are “unlimited”.

You can control what types of failures cause retries or not as well. You specify which failures should not cause retries by adding them in ActivityOptions->RetryOptions->DoNotRetry. For example if you do not want your activity to retry on IllegalArgumentException:


ActivityOptions.newBuilder()
  .setRetryOptions(RetryOptions.newBuilder()
  .setDoNotRetry(IllegalArgumentException.class.getName())
  .build())
.build());

Another option is to throw a non retryable application failure inside your activity, created via ApplicationFailure.newNonRetryableFailure.

With that, along with ability to get the retry attempt inside activity code, you could, depending on your business logic control at what point retries should stop, and can perform compensation logic inside your workflow or whatever you need to do.

Having automatic retries in the end is super helpful, as you can change your activity method code, and its activity options (and restart worker) to fix errors without breaking workflow determinism.

3 Likes

Thanks for your answer!

Just a small translation of your response in go (as it’s the language I’m using):

	openWorkflows, err := client.ListOpenWorkflow(context.Background(), &workflowservice.ListOpenWorkflowExecutionsRequest{
		Namespace: "default",
	})
	if err != nil {
		log.Fatalln("fail to list open workflows", err)
	}

	for _, openWorkflow := range openWorkflows.GetExecutions() {
		describe, err := client.DescribeWorkflowExecution(context.Background(), openWorkflow.Execution.WorkflowId, openWorkflow.Execution.RunId)
		if err != nil {
			log.Fatalln("fail to descibe workflow", err)
		}

		for _, pendingActivity := range describe.GetPendingActivities() {
			log.Println(pendingActivity.GetAttempt(), pendingActivity.GetActivityType().Name, pendingActivity.GetLastFailure().Message)
		}
	}

Nice! Much less verbose indeed :slight_smile:

Hey, Can you please help me do the same(getting no.of retries inside activity code) using typeScript?
Thankyou.

Hello @kishore_kumar

in typescript you have client.workflowService.describeWorkflowExecution that return a DescribeWorkflowExecutionResponse that contains pendingActivities. For each pending activity you can get attempt

Antonio

This seems to make a separate request for each workflow - at what point does this become a problem/hit rate limits?