Workflow with promises is not recovering gracefully when restarting client

I’ve been working on a POC for a DSL driven workflow using the java sdk. It executes multiple Promises using Async.function and awaits their completion via Promise.allOf.

It has been working as expected, however, I’ve discovered that if I restart the temporal client while the workflow is running, the (restarted) workflow run repeatedly throws an error as it attempts to resume where it left off. The error will occur at different points depending on where the original workflow run was cut off. The error doesn’t surface in the logs. Instead I just see the same segment of the workflow run repeating. But after debugging I was able to locate the error, which is thrown at POJOWorkflowImplementationFactory.class:236

io.temporal.internal.sync.DestroyWorkflowThreadError
	at io.temporal.internal.sync.WorkflowThreadContext.yield(WorkflowThreadContext.java:87)
	at io.temporal.internal.sync.WorkflowThreadImpl.yield(WorkflowThreadImpl.java:387)
	at io.temporal.internal.sync.WorkflowThread.await(WorkflowThread.java:46)
	at io.temporal.internal.statemachines.WorkflowStateMachines.checkEventLoopExecuting(WorkflowStateMachines.java:1043)
	at io.temporal.internal.statemachines.WorkflowStateMachines.randomUUID(WorkflowStateMachines.java:700)
	at io.temporal.internal.replay.ReplayWorkflowContextImpl.scheduleActivityTask(ReplayWorkflowContextImpl.java:193)
	at io.temporal.internal.sync.SyncWorkflowContext.executeActivityOnce(SyncWorkflowContext.java:257)
	at io.temporal.internal.sync.SyncWorkflowContext.executeActivity(SyncWorkflowContext.java:238)
	at io.temporal.internal.sync.ActivityStubImpl.executeAsync(ActivityStubImpl.java:50)
	at io.temporal.internal.sync.ActivityStubBase.execute(ActivityStubBase.java:39)
	at io.temporal.internal.sync.ActivityInvocationHandler.lambda$getActivityFunc$0(ActivityInvocationHandler.java:78)
	at io.temporal.internal.sync.ActivityInvocationHandlerBase.invoke(ActivityInvocationHandlerBase.java:60)
	at jdk.proxy2/jdk.proxy2.$Proxy147.registerExecutionRun(Unknown Source)
	at com.code.workflow.OperationWorkflowImpl.initRun(OperationWorkflowImpl.java:125)
	at com.code.workflow.OperationWorkflowImpl.execute(OperationWorkflowImpl.java:75)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
	at io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation$RootWorkflowInboundCallsInterceptor.execute(POJOWorkflowImplementationFactory.java:302)
	at io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation.execute(POJOWorkflowImplementationFactory.java:277)
	at io.temporal.internal.sync.WorkflowExecutionHandler.runWorkflowMethod(WorkflowExecutionHandler.java:70)
	at io.temporal.internal.sync.SyncWorkflow.lambda$start$0(SyncWorkflow.java:116)
	at io.temporal.internal.sync.CancellationScopeImpl.run(CancellationScopeImpl.java:102)
	at io.temporal.internal.sync.WorkflowThreadImpl$RunnableWrapper.run(WorkflowThreadImpl.java:106)
	at io.temporal.worker.ActiveThreadReportingExecutor.lambda$submit$0(ActiveThreadReportingExecutor.java:53)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577)
	at java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:317)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1589)

Is there something I should be doing to ensure that workflows that use Promises recover gracefully?

Are you by chance catching Throwable or Error in your workflow code?

DestroyWorkflowThreadError is used internally by SDK to release threads when a workflow exec is removed from its cache.

Instead I just see the same segment of the workflow run repeating

Can you share the execution event history?

I don’t believe I’m catching errors in a problematic way. Here’s what the activity looks like that I’m executing via promises:

@Override
    public String executeAction(String temporalId, Action action) {

        logger.info("starting action");
        String actionString = "%s - action %s - async %s"
                .formatted(
                        temporalId,
                        action.getName(),
                        action.getAsync()
                );

        for (int i = 0; i < 3; i++) {
            logger.info(
                    "Executing {} - step {}",
                    actionString,
                    i
            );
            try {
                Thread.sleep(action.getTimeout());
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
        return "Completed " + actionString;
    }

I noticed that the error occurred when executing a certain activity method which I use to create a log for the run in the db, and then another for getting a dsl from the db. If I remove all of those activity method calls and just call the above executeAction method, I no longer get the error that I mentioned after restarting. However, the workflow still appears to be stuck: it doesn’t complete after restart, and new runs are not started (it’s on a cron schedule).

Below I’m including the workflow function that I’m executing (which I’ve simplified a bit but I’ve confirmed that it behaves the same way) and the event history that I see when a workflow gets stuck (after restarting the client).

@Override
    public WorkflowResult execute(WorkflowExecutionConfig workflowExecutionConfig) {

        this.workflowExecutionConfig = workflowExecutionConfig;
        this.workflowExecutionStatus = WorkflowExecutionStatus.RUNNING;
        this.exception = null;

        this.workflowId = Workflow.getInfo().getWorkflowId();
        this.runId = Workflow.getInfo().getRunId();
        this.temporalId = String.format("%s:%s", this.workflowId, this.runId);

        logger.info(
                "Executing Workflow Id: {}; Run Id: {}",
                this.workflowId,
                this.runId
        );

        initRun(workflowExecutionConfig);

        List<Promise<String>> promiseList = new ArrayList<>();

        promiseList.add(
                Async.function(
                        this.workflowOperationActivity::test,
                        temporalId,
                        "action 1"
                )
        );

        promiseList.add(
                Async.function(
                        this.workflowOperationActivity::test,
                        temporalId,
                        "action 2"
                )
        );

        Promise.allOf(promiseList);

        for (Promise<String> promise : promiseList) {
            if (promise.getFailure() == null) {
                logger.info(promise.get());
            } else {
                logger.error("An error occurred during action execution: ", promise.getFailure());
            }
        }

        List<Promise<String>> promiseList2 = new ArrayList<>();

        promiseList2.add(
                Async.function(
                        this.workflowOperationActivity::test,
                        temporalId,
                        "action 3"
                )
        );

        Promise.allOf(promiseList2);

        for (Promise<String> promise : promiseList2) {
            if (promise.getFailure() == null) {
                logger.info(promise.get());
            } else {
                logger.error("An error occurred during action execution: ", promise.getFailure());
            }
        }

        completeExecutionRun(WorkflowExecutionStatus.COMPLETE);

        return new WorkflowResult(
                "done"
        );
    }

{
  "events": [
    {
      "eventId": "1",
      "eventTime": "2023-02-08T16:45:40.421158545Z",
      "eventType": "WorkflowExecutionStarted",
      "version": "0",
      "taskId": "36703395",
      "workerMayIgnore": false,
      "workflowExecutionStartedEventAttributes": {
        "workflowType": {
          "name": "OperationWorkflow"
        },
        "parentWorkflowNamespace": "",
        "parentWorkflowNamespaceId": "",
        "parentWorkflowExecution": null,
        "parentInitiatedEventId": "0",
        "taskQueue": {
          "name": "OPERATION_TASK_QUEUE",
          "kind": "Normal"
        },
        "input": {
          "payloads": [
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "eyJ0ZW1wb3JhbElkIjoiT3BlcmF0aW9uV29ya2Zsb3ctMS1jLTEiLCJhc21JZCI6MSwiY2xpZW50SWQiOjEsIm5hbWUiOiJ0ZXN0IDEiLCJwYXJhbXMiOnt9LCJ0eXBlIjoiY29tLm5ldHNwaS50ZW1wb3JhbC5jbGllbnQud29ya2Zsb3cuT3BlcmF0aW9uV29ya2Zsb3ciLCJkZXNjcmlwdGlvbiI6IlNjYW4gV29ya2Zsb3cgdGVzdCAxIGZvciBjbGllbnQgQGV2ZXJ5IDE1cywgY3JlYXRlZCAyMDIzLTAxLTE5IDA1OjQ1OjU1LjIyNiIsImNyb25TY2hlZHVsZSI6IkBldmVyeSAxNXMiLCJ0YXNrUXVldWVOYW1lIjoiT1BFUkFUSU9OX1RBU0tfUVVFVUUiLCJvcGVyYXRpb25Ec2wiOnsibmFtZSI6InRlc3Qgd29ya2Zsb3ciLCJ2ZXJzaW9uIjoiMS4wIiwidXBkYXRlZEF0IjoiMjAyMy0wMi0wNlQyMDoxODoyNy4zNDkrMDA6MDAiLCJhY3Rpb25zIjpbeyJuYW1lIjoiYWN0aW9uIDEiLCJvcmRpbmFsIjowLCJ0aW1lb3V0IjozMDAwLCJhc3luYyI6ZmFsc2UsImNoaWxkcmVuIjpbeyJuYW1lIjoiYWN0aW9uIDEuMiIsIm9yZGluYWwiOjAsInRpbWVvdXQiOjMwMDAsImFzeW5jIjpmYWxzZSwiY2hpbGRyZW4iOm51bGx9LHsibmFtZSI6ImFjdGlvbiAxLjJiIiwib3JkaW5hbCI6MCwidGltZW91dCI6MzAwMCwiYXN5bmMiOmZhbHNlLCJjaGlsZHJlbiI6bnVsbH1dfSx7Im5hbWUiOiJhY3Rpb24gMiBhc3luYyIsIm9yZGluYWwiOjAsInRpbWVvdXQiOjIwMDAwLCJhc3luYyI6dHJ1ZSwiY2hpbGRyZW4iOm51bGx9LHsibmFtZSI6ImFjdGlvbiAzIiwib3JkaW5hbCI6MSwidGltZW91dCI6MzAwMCwiYXN5bmMiOmZhbHNlLCJjaGlsZHJlbiI6bnVsbH1dfX0="
            }
          ]
        },
        "workflowExecutionTimeout": "0s",
        "workflowRunTimeout": "0s",
        "workflowTaskTimeout": "10s",
        "continuedExecutionRunId": "fb350694-11af-4d2d-b895-cf8baff258dc",
        "initiator": "CronSchedule",
        "continuedFailure": null,
        "lastCompletionResult": {
          "payloads": [
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "eyJyZXN1bHQiOiJkb25lXG4ifQ=="
            }
          ]
        },
        "originalExecutionRunId": "afe3f47b-3df6-4cc7-ade3-4391ea45cfa4",
        "identity": "",
        "firstExecutionRunId": "fb350694-11af-4d2d-b895-cf8baff258dc",
        "retryPolicy": null,
        "attempt": 1,
        "workflowExecutionExpirationTime": null,
        "cronSchedule": "@every 15s",
        "firstWorkflowTaskBackoff": "5s",
        "memo": null,
        "searchAttributes": {
          "indexedFields": {
            "testId": {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg==",
                "type": "SW50"
              },
              "data": "MQ=="
            },
            "clientId": {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg==",
                "type": "SW50"
              },
              "data": "MQ=="
            }
          }
        },
        "prevAutoResetPoints": {
          "points": []
        },
        "header": {
          "fields": {}
        },
        "parentInitiatedEventVersion": "0"
      }
    },
    {
      "eventId": "2",
      "eventTime": "2023-02-08T16:45:45.422540471Z",
      "eventType": "WorkflowTaskScheduled",
      "version": "0",
      "taskId": "36703402",
      "workerMayIgnore": false,
      "workflowTaskScheduledEventAttributes": {
        "taskQueue": {
          "name": "OPERATION_TASK_QUEUE",
          "kind": "Normal"
        },
        "startToCloseTimeout": "10s",
        "attempt": 1
      }
    },
    {
      "eventId": "3",
      "eventTime": "2023-02-08T16:45:45.431392916Z",
      "eventType": "WorkflowTaskStarted",
      "version": "0",
      "taskId": "36703405",
      "workerMayIgnore": false,
      "workflowTaskStartedEventAttributes": {
        "scheduledEventId": "2",
        "identity": "124964@test-1094",
        "requestId": "d4f56569-1627-4185-a66b-0b99442ca844",
        "suggestContinueAsNew": false,
        "historySizeBytes": "0"
      }
    },
    {
      "eventId": "4",
      "eventTime": "2023-02-08T16:45:45.447092403Z",
      "eventType": "WorkflowTaskCompleted",
      "version": "0",
      "taskId": "36703409",
      "workerMayIgnore": false,
      "workflowTaskCompletedEventAttributes": {
        "scheduledEventId": "2",
        "startedEventId": "3",
        "identity": "124964@test-1094",
        "binaryChecksum": "",
        "workerVersioningId": null
      }
    },
    {
      "eventId": "5",
      "eventTime": "2023-02-08T16:45:45.447122791Z",
      "eventType": "ActivityTaskScheduled",
      "version": "0",
      "taskId": "36703410",
      "workerMayIgnore": false,
      "activityTaskScheduledEventAttributes": {
        "activityId": "cf368111-b4e4-33eb-9d2d-fb2bf5377cbc",
        "activityType": {
          "name": "RegisterExecutionRun"
        },
        "taskQueue": {
          "name": "OPERATION_TASK_QUEUE",
          "kind": "Normal"
        },
        "header": {
          "fields": {}
        },
        "input": {
          "payloads": [
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "ImFmZTNmNDdiLTNkZjYtNGNjNy1hZGUzLTQzOTFlYTQ1Y2ZhNCI="
            },
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "MQ=="
            },
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "IlJVTk5JTkci"
            }
          ]
        },
        "scheduleToCloseTimeout": "600s",
        "scheduleToStartTimeout": "600s",
        "startToCloseTimeout": "600s",
        "heartbeatTimeout": "0s",
        "workflowTaskCompletedEventId": "4",
        "retryPolicy": {
          "initialInterval": "1s",
          "backoffCoefficient": 2,
          "maximumInterval": "100s",
          "maximumAttempts": 0,
          "nonRetryableErrorTypes": []
        }
      }
    },
    {
      "eventId": "6",
      "eventTime": "2023-02-08T16:45:45.473452655Z",
      "eventType": "ActivityTaskStarted",
      "version": "0",
      "taskId": "36703416",
      "workerMayIgnore": false,
      "activityTaskStartedEventAttributes": {
        "scheduledEventId": "5",
        "identity": "124964@test-1094",
        "requestId": "9580861e-fa35-45df-938d-2cddf6952e49",
        "attempt": 1,
        "lastFailure": null
      }
    },
    {
      "eventId": "7",
      "eventTime": "2023-02-08T16:45:45.500327478Z",
      "eventType": "ActivityTaskCompleted",
      "version": "0",
      "taskId": "36703417",
      "workerMayIgnore": false,
      "activityTaskCompletedEventAttributes": {
        "result": {
          "payloads": [
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "NTE1"
            }
          ]
        },
        "scheduledEventId": "5",
        "startedEventId": "6",
        "identity": "124964@test-1094"
      }
    },
    {
      "eventId": "8",
      "eventTime": "2023-02-08T16:45:45.500334672Z",
      "eventType": "WorkflowTaskScheduled",
      "version": "0",
      "taskId": "36703418",
      "workerMayIgnore": false,
      "workflowTaskScheduledEventAttributes": {
        "taskQueue": {
          "name": "124964@test-1094:ceca17c2-a27c-459e-b803-694751236141",
          "kind": "Sticky"
        },
        "startToCloseTimeout": "10s",
        "attempt": 1
      }
    },
    {
      "eventId": "9",
      "eventTime": "2023-02-08T16:45:45.517118021Z",
      "eventType": "WorkflowTaskStarted",
      "version": "0",
      "taskId": "36703422",
      "workerMayIgnore": false,
      "workflowTaskStartedEventAttributes": {
        "scheduledEventId": "8",
        "identity": "124964@test-1094",
        "requestId": "84dd8d87-5362-4887-976a-9a1d17b5eb23",
        "suggestContinueAsNew": false,
        "historySizeBytes": "0"
      }
    },
    {
      "eventId": "10",
      "eventTime": "2023-02-08T16:45:45.535409482Z",
      "eventType": "WorkflowTaskCompleted",
      "version": "0",
      "taskId": "36703426",
      "workerMayIgnore": false,
      "workflowTaskCompletedEventAttributes": {
        "scheduledEventId": "8",
        "startedEventId": "9",
        "identity": "124964@test-1094",
        "binaryChecksum": "",
        "workerVersioningId": null
      }
    },
    {
      "eventId": "11",
      "eventTime": "2023-02-08T16:45:45.535436704Z",
      "eventType": "ActivityTaskScheduled",
      "version": "0",
      "taskId": "36703427",
      "workerMayIgnore": false,
      "activityTaskScheduledEventAttributes": {
        "activityId": "b28bc473-76e4-32c1-ac0d-95ec80db9560",
        "activityType": {
          "name": "GetOperationDsl"
        },
        "taskQueue": {
          "name": "OPERATION_TASK_QUEUE",
          "kind": "Normal"
        },
        "header": {
          "fields": {}
        },
        "input": {
          "payloads": [
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "MQ=="
            },
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "MQ=="
            }
          ]
        },
        "scheduleToCloseTimeout": "600s",
        "scheduleToStartTimeout": "600s",
        "startToCloseTimeout": "600s",
        "heartbeatTimeout": "0s",
        "workflowTaskCompletedEventId": "10",
        "retryPolicy": {
          "initialInterval": "1s",
          "backoffCoefficient": 2,
          "maximumInterval": "100s",
          "maximumAttempts": 0,
          "nonRetryableErrorTypes": []
        }
      }
    },
    {
      "eventId": "12",
      "eventTime": "2023-02-08T16:45:45.548948089Z",
      "eventType": "ActivityTaskStarted",
      "version": "0",
      "taskId": "36703433",
      "workerMayIgnore": false,
      "activityTaskStartedEventAttributes": {
        "scheduledEventId": "11",
        "identity": "124964@test-1094",
        "requestId": "f08bf1bc-6000-4ca1-8552-5c542236ff31",
        "attempt": 1,
        "lastFailure": null
      }
    },
    {
      "eventId": "13",
      "eventTime": "2023-02-08T16:45:45.566799784Z",
      "eventType": "ActivityTaskCompleted",
      "version": "0",
      "taskId": "36703434",
      "workerMayIgnore": false,
      "activityTaskCompletedEventAttributes": {
        "result": {
          "payloads": [
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "IntcbiAgICBcIm5hbWVcIjogXCJ0ZXN0IHdvcmtmbG93XCIsXG4gICAgXCJ2ZXJzaW9uXCI6IFwiMS4wXCIsXG4gICAgXCJ1cGRhdGVkQXRcIjogMTY3NTcxNDcwNzM0OSxcbiAgICBcImFjdGlvbnNcIjogW3tcbiAgICAgICAgICAgIFwibmFtZVwiOiBcImFjdGlvbiAxXCIsXG4gICAgICAgICAgICBcIm9yZGluYWxcIjogMCxcbiAgICAgICAgICAgIFwidGltZW91dFwiOiAzMDAwLFxuICAgICAgICAgICAgXCJhc3luY1wiOiBmYWxzZSxcbiAgICAgICAgICAgIFwiY2hpbGRyZW5cIjogW3tcbiAgICAgICAgICAgICAgICAgICAgXCJuYW1lXCI6IFwiYWN0aW9uIDEuMlwiLFxuICAgICAgICAgICAgICAgICAgICBcIm9yZGluYWxcIjogMCxcbiAgICAgICAgICAgICAgICAgICAgXCJ0aW1lb3V0XCI6IDMwMDAsXG4gICAgICAgICAgICAgICAgICAgIFwiYXN5bmNcIjogZmFsc2UsXG4gICAgICAgICAgICAgICAgICAgIFwiY2hpbGRyZW5cIjogbnVsbFxuICAgICAgICAgICAgICAgIH0sIHtcbiAgICAgICAgICAgICAgICAgICAgXCJuYW1lXCI6IFwiYWN0aW9uIDEuMmJcIixcbiAgICAgICAgICAgICAgICAgICAgXCJvcmRpbmFsXCI6IDAsXG4gICAgICAgICAgICAgICAgICAgIFwidGltZW91dFwiOiAzMDAwLFxuICAgICAgICAgICAgICAgICAgICBcImFzeW5jXCI6IGZhbHNlLFxuICAgICAgICAgICAgICAgICAgICBcImNoaWxkcmVuXCI6IG51bGxcbiAgICAgICAgICAgICAgICB9XG4gICAgICAgICAgICBdXG4gICAgICAgIH0sIHtcbiAgICAgICAgICAgIFwibmFtZVwiOiBcImFjdGlvbiAyIGFzeW5jXCIsXG4gICAgICAgICAgICBcIm9yZGluYWxcIjogMCxcbiAgICAgICAgICAgIFwidGltZW91dFwiOiAyMDAwMCxcbiAgICAgICAgICAgIFwiYXN5bmNcIjogdHJ1ZSxcbiAgICAgICAgICAgIFwiY2hpbGRyZW5cIjogbnVsbFxuICAgICAgICB9LCB7XG4gICAgICAgICAgICBcIm5hbWVcIjogXCJhY3Rpb24gM1wiLFxuICAgICAgICAgICAgXCJvcmRpbmFsXCI6IDEsXG4gICAgICAgICAgICBcInRpbWVvdXRcIjogMzAwMCxcbiAgICAgICAgICAgIFwiYXN5bmNcIjogZmFsc2UsXG4gICAgICAgICAgICBcImNoaWxkcmVuXCI6IG51bGxcbiAgICAgICAgfVxuICAgIF1cbn0i"
            }
          ]
        },
        "scheduledEventId": "11",
        "startedEventId": "12",
        "identity": "124964@test-1094"
      }
    },
    {
      "eventId": "14",
      "eventTime": "2023-02-08T16:45:45.566805645Z",
      "eventType": "WorkflowTaskScheduled",
      "version": "0",
      "taskId": "36703435",
      "workerMayIgnore": false,
      "workflowTaskScheduledEventAttributes": {
        "taskQueue": {
          "name": "124964@test-1094:ceca17c2-a27c-459e-b803-694751236141",
          "kind": "Sticky"
        },
        "startToCloseTimeout": "10s",
        "attempt": 1
      }
    },
    {
      "eventId": "15",
      "eventTime": "2023-02-08T16:45:45.577978226Z",
      "eventType": "WorkflowTaskStarted",
      "version": "0",
      "taskId": "36703439",
      "workerMayIgnore": false,
      "workflowTaskStartedEventAttributes": {
        "scheduledEventId": "14",
        "identity": "124964@test-1094",
        "requestId": "cfd34f25-4867-4dd1-8250-d40e8fcc461c",
        "suggestContinueAsNew": false,
        "historySizeBytes": "0"
      }
    },
    {
      "eventId": "16",
      "eventTime": "2023-02-08T16:45:45.599749368Z",
      "eventType": "WorkflowTaskCompleted",
      "version": "0",
      "taskId": "36703443",
      "workerMayIgnore": false,
      "workflowTaskCompletedEventAttributes": {
        "scheduledEventId": "14",
        "startedEventId": "15",
        "identity": "124964@test-1094",
        "binaryChecksum": "",
        "workerVersioningId": null
      }
    },
    {
      "eventId": "17",
      "eventTime": "2023-02-08T16:45:45.599779656Z",
      "eventType": "ActivityTaskScheduled",
      "version": "0",
      "taskId": "36703444",
      "workerMayIgnore": false,
      "activityTaskScheduledEventAttributes": {
        "activityId": "e56a729d-12dd-3497-8708-6d080b0517b0",
        "activityType": {
          "name": "Test"
        },
        "taskQueue": {
          "name": "OPERATION_TASK_QUEUE",
          "kind": "Normal"
        },
        "header": {
          "fields": {}
        },
        "input": {
          "payloads": [
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "Ik9wZXJhdGlvbldvcmtmbG93LTEtYy0xOmFmZTNmNDdiLTNkZjYtNGNjNy1hZGUzLTQzOTFlYTQ1Y2ZhNCI="
            },
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "ImFjdGlvbiAxIg=="
            }
          ]
        },
        "scheduleToCloseTimeout": "600s",
        "scheduleToStartTimeout": "600s",
        "startToCloseTimeout": "600s",
        "heartbeatTimeout": "0s",
        "workflowTaskCompletedEventId": "16",
        "retryPolicy": {
          "initialInterval": "1s",
          "backoffCoefficient": 2,
          "maximumInterval": "100s",
          "maximumAttempts": 0,
          "nonRetryableErrorTypes": []
        }
      }
    },
    {
      "eventId": "18",
      "eventTime": "2023-02-08T16:45:45.599803702Z",
      "eventType": "ActivityTaskScheduled",
      "version": "0",
      "taskId": "36703445",
      "workerMayIgnore": false,
      "activityTaskScheduledEventAttributes": {
        "activityId": "13b21c30-11f6-3517-a771-7e117fc2aa92",
        "activityType": {
          "name": "Test"
        },
        "taskQueue": {
          "name": "OPERATION_TASK_QUEUE",
          "kind": "Normal"
        },
        "header": {
          "fields": {}
        },
        "input": {
          "payloads": [
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "Ik9wZXJhdGlvbldvcmtmbG93LTEtYy0xOmFmZTNmNDdiLTNkZjYtNGNjNy1hZGUzLTQzOTFlYTQ1Y2ZhNCI="
            },
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "ImFjdGlvbiAyIg=="
            }
          ]
        },
        "scheduleToCloseTimeout": "600s",
        "scheduleToStartTimeout": "600s",
        "startToCloseTimeout": "600s",
        "heartbeatTimeout": "0s",
        "workflowTaskCompletedEventId": "16",
        "retryPolicy": {
          "initialInterval": "1s",
          "backoffCoefficient": 2,
          "maximumInterval": "100s",
          "maximumAttempts": 0,
          "nonRetryableErrorTypes": []
        }
      }
    },
    {
      "eventId": "19",
      "eventTime": "2023-02-08T16:45:45.617185292Z",
      "eventType": "ActivityTaskStarted",
      "version": "0",
      "taskId": "36703453",
      "workerMayIgnore": false,
      "activityTaskStartedEventAttributes": {
        "scheduledEventId": "17",
        "identity": "124964@test-1094",
        "requestId": "092a15ee-85a0-473e-ae6a-67292d949f34",
        "attempt": 1,
        "lastFailure": null
      }
    },
    {
      "eventId": "20",
      "eventTime": "2023-02-08T16:45:45.611671028Z",
      "eventType": "ActivityTaskStarted",
      "version": "0",
      "taskId": "36703454",
      "workerMayIgnore": false,
      "activityTaskStartedEventAttributes": {
        "scheduledEventId": "18",
        "identity": "124964@test-1094",
        "requestId": "f03c5eb3-1f7b-49ec-b3a4-c1d284b46795",
        "attempt": 1,
        "lastFailure": null
      }
    },
    {
      "eventId": "21",
      "eventTime": "2023-02-08T16:55:45.602108889Z",
      "eventType": "ActivityTaskTimedOut",
      "version": "0",
      "taskId": "36703455",
      "workerMayIgnore": false,
      "activityTaskTimedOutEventAttributes": {
        "failure": {
          "message": "activity ScheduleToClose timeout",
          "source": "Server",
          "stackTrace": "",
          "encodedAttributes": null,
          "cause": null,
          "timeoutFailureInfo": {
            "timeoutType": "ScheduleToClose",
            "lastHeartbeatDetails": null
          }
        },
        "scheduledEventId": "17",
        "startedEventId": "19",
        "retryState": "NonRetryableFailure"
      }
    },
    {
      "eventId": "22",
      "eventTime": "2023-02-08T16:55:45.602113739Z",
      "eventType": "ActivityTaskTimedOut",
      "version": "0",
      "taskId": "36703456",
      "workerMayIgnore": false,
      "activityTaskTimedOutEventAttributes": {
        "failure": {
          "message": "activity ScheduleToClose timeout",
          "source": "Server",
          "stackTrace": "",
          "encodedAttributes": null,
          "cause": null,
          "timeoutFailureInfo": {
            "timeoutType": "ScheduleToClose",
            "lastHeartbeatDetails": null
          }
        },
        "scheduledEventId": "18",
        "startedEventId": "20",
        "retryState": "NonRetryableFailure"
      }
    },
    {
      "eventId": "23",
      "eventTime": "2023-02-08T16:55:45.602117626Z",
      "eventType": "WorkflowTaskScheduled",
      "version": "0",
      "taskId": "36703457",
      "workerMayIgnore": false,
      "workflowTaskScheduledEventAttributes": {
        "taskQueue": {
          "name": "OPERATION_TASK_QUEUE",
          "kind": "Normal"
        },
        "startToCloseTimeout": "10s",
        "attempt": 1
      }
    },
    {
      "eventId": "24",
      "eventTime": "2023-02-08T16:55:45.618226278Z",
      "eventType": "WorkflowTaskStarted",
      "version": "0",
      "taskId": "36703460",
      "workerMayIgnore": false,
      "workflowTaskStartedEventAttributes": {
        "scheduledEventId": "23",
        "identity": "121768@test-1094",
        "requestId": "2ccd087a-8a7a-4445-b849-1d519c538621",
        "suggestContinueAsNew": false,
        "historySizeBytes": "0"
      }
    },
    {
      "eventId": "25",
      "eventTime": "2023-02-08T16:55:45.657957965Z",
      "eventType": "WorkflowTaskCompleted",
      "version": "0",
      "taskId": "36703464",
      "workerMayIgnore": false,
      "workflowTaskCompletedEventAttributes": {
        "scheduledEventId": "23",
        "startedEventId": "24",
        "identity": "121768@test-1094",
        "binaryChecksum": "",
        "workerVersioningId": null
      }
    },
    {
      "eventId": "26",
      "eventTime": "2023-02-08T16:55:45.657981630Z",
      "eventType": "ActivityTaskScheduled",
      "version": "0",
      "taskId": "36703465",
      "workerMayIgnore": false,
      "activityTaskScheduledEventAttributes": {
        "activityId": "7ade32f1-acae-3033-b51b-cb2052947096",
        "activityType": {
          "name": "Test"
        },
        "taskQueue": {
          "name": "OPERATION_TASK_QUEUE",
          "kind": "Normal"
        },
        "header": {
          "fields": {}
        },
        "input": {
          "payloads": [
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "Ik9wZXJhdGlvbldvcmtmbG93LTEtYy0xOmFmZTNmNDdiLTNkZjYtNGNjNy1hZGUzLTQzOTFlYTQ1Y2ZhNCI="
            },
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": "ImFjdGlvbiAzIg=="
            }
          ]
        },
        "scheduleToCloseTimeout": "600s",
        "scheduleToStartTimeout": "600s",
        "startToCloseTimeout": "600s",
        "heartbeatTimeout": "0s",
        "workflowTaskCompletedEventId": "25",
        "retryPolicy": {
          "initialInterval": "1s",
          "backoffCoefficient": 2,
          "maximumInterval": "100s",
          "maximumAttempts": 0,
          "nonRetryableErrorTypes": []
        }
      }
    }
  ]
}

Last event in history is "eventType": "ActivityTaskScheduled". ActivityTaskStarted is written to history when activity completes (or fails or times out after all its retries).
In your Web ui check “Pending Activities” tab for this execution or could get it via cli:

tctl wf desc -w <wfid> -r <runid>

Also check the stack trace (via web ui again or cli with tctl wf query -w <wfid> -r <runid> --qt __stack_trace

Ah I was missing the stack trace info, thanks for pointing that out. I had to recreate another run with the same error loop. Here’s the pending activities and stack trace from that:

workflow-method-OperationWorkflow-1-c-1-0f1e8095-45fb-4c0d-ba97-6d3d9d9be943: (BLOCKED on Feature.get)
io.temporal.internal.sync.WorkflowThreadContext.yield(WorkflowThreadContext.java:94)
io.temporal.internal.sync.WorkflowThreadImpl.yield(WorkflowThreadImpl.java:387)
io.temporal.internal.sync.WorkflowThread.await(WorkflowThread.java:46)
io.temporal.internal.sync.CompletablePromiseImpl.getFailure(CompletablePromiseImpl.java:145)
com.code.workflow.OperationWorkflowImpl.execute(OperationWorkflowImpl.java:155)
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
java.base/java.lang.reflect.Method.invoke(Method.java:578)
io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation$RootWorkflowInboundCallsInterceptor.execute(POJOWorkflowImplementationFactory.java:302)
io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation.execute(POJOWorkflowImplementationFactory.java:277)

The OperationWorkflowImpl.java line points to promise.getFailure() == null from the workflow function that I posted.

Here’s the pending activities from tctl:

{
  "executionConfig": {
    "taskQueue": {
      "name": "OPERATION_TASK_QUEUE",
      "kind": "Normal"
    },
    "workflowExecutionTimeout": "0s",
    "workflowRunTimeout": "0s",
    "defaultWorkflowTaskTimeout": "10s"
  },
  "workflowExecutionInfo": {
    "execution": {
      "workflowId": "OperationWorkflow-1-c-1",
      "runId": "0f1e8095-45fb-4c0d-ba97-6d3d9d9be943"
    },
    "type": {
      "name": "OperationWorkflow"
    },
    "startTime": "2023-02-08T20:27:24.773970786Z",
    "status": "Running",
    "historyLength": "18",
    "memo": {

    },
    "searchAttributes": {
      "indexedFields": {
        "testId": "1",
        "clientId": "1"
      }
    },
    "autoResetPoints": {

    },
    "stateTransitionCount": "14"
  },
  "pendingActivities": [
    {
      "activityId": "3ab474ee-fef1-3c9a-854d-e18bd27d5866",
      "activityType": {
        "name": "Test"
      },
      "state": "Started",
      "lastStartedTime": "2023-02-08T20:27:31.118983925Z",
      "attempt": 1,
      "expirationTime": "2023-02-08T20:37:31.099028607Z",
      "lastWorkerIdentity": "16028@test-1094"
    },
    {
      "activityId": "26b4d106-d8dd-36ec-919d-97d34ac63ef0",
      "activityType": {
        "name": "Test"
      },
      "state": "Started",
      "lastStartedTime": "2023-02-08T20:27:31.135965466Z",
      "attempt": 1,
      "expirationTime": "2023-02-08T20:37:31.099190386Z",
      "lastWorkerIdentity": "16028@test-1094"
    }
  ]
}
``

Promise.allOf(promiseList);

In your case this is a no-op as you are not waiting for allOf promise to complete, do instead

Promise.allOf(promiseList).get();

This I think also explains the stack trace and the reference to blocking on promise.getFailure() as you have pending activities

Thanks for pointing that out. I just added .get() to my Promise.allOf method calls. Now, after restarting, it’s throwing the following error on the first updated call: Promise.allOf(promiseList).get();:

workflow-method-OperationWorkflow-1-c-1-86153de6-588b-4950-9194-c1a8016cffef: (BLOCKED on Feature.get)
io.temporal.internal.sync.WorkflowThreadContext.yield(WorkflowThreadContext.java:94)
io.temporal.internal.sync.WorkflowThreadImpl.yield(WorkflowThreadImpl.java:387)
io.temporal.internal.sync.WorkflowThread.await(WorkflowThread.java:46)
io.temporal.internal.sync.CompletablePromiseImpl.getImpl(CompletablePromiseImpl.java:85)
io.temporal.internal.sync.CompletablePromiseImpl.get(CompletablePromiseImpl.java:75)
io.temporal.internal.sync.AllOfPromise.get(AllOfPromise.java:82)
io.temporal.internal.sync.AllOfPromise.get(AllOfPromise.java:30)
com.code.workflow.OperationWorkflowImpl.execute(OperationWorkflowImpl.java:152)
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
java.base/java.lang.reflect.Method.invoke(Method.java:578)
io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation$RootWorkflowInboundCallsInterceptor.execute(POJOWorkflowImplementationFactory.java:302)
io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation.execute(POJOWorkflowImplementationFactory.java:277)

I think that’s expected, Promise.allOf becomes complete when either all promises in your list complete, or when a single promise in list fails.

I see that you are setting ScheduleToClose timeout to 10m for these activities. My guess is that currently they are retrying and will exhaust retries when ScheduleToClose is reached. Does your workflow execution block on these activities for longer than 10m?

"scheduleToCloseTimeout": "600s",
"scheduleToStartTimeout": "600s",
"startToCloseTimeout": "600s",

Would remove ScheduleToStart timeout unless you really need it for your use case.
Your StartToClose timeout typically should be smaller than ScheduleToClose. StartToClose is max time of a single activity execution, ScheduleToClose max time for activity execution including all its retries.
See video here thats very helpful.

The way you have it set now, on worker restart service would wait for 10m before trying to retry the activity, which would not happen because of set ScheduleToClose timeout (also 10m) and activity should fail at that point.

For the activity method, I just have dummy code that prints a log every 3s using Thread.sleep (which I see is fine for activity methods though not for workflows). When I don’t restart the client and let a workflow complete it only take about 15s, so it shouldn’t be timing out.

I have made some further observations in the meantime:

  1. I’ve confirmed that I don’t encounter the error If I recreate the activity method in the workflow (using Workflow.sleep instead of Thread.sleep).

  2. Also, I’ve discovered that I get this error regardless of whether the activity is called within a Promise or not. This has made me suspect that something is wrong with how Thread.sleep was being used. But I’ve updated the activity method to more closely copy what’s in the instructional project temporal-java-workshop and it still returns the same error.

Here’s the new version of the activity method that I’m calling, which throws the error on restart even when it’s not called within a promise:

    @Override
    public String test(String temporalString, String name) {

        for (int i = 0; i <= 3; i++) {
            logger.info(
                    "Executing {} - {} - step {}",
                    temporalString,
                    name,
                    i
            );
            sleep(3);
        }

        return "completed";
    }

    private void sleep(int seconds) {
        try {
            Thread.sleep(seconds * 1000);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

Thanks, I’ll try those settings now

I tried updating the activity options to a few different permutations of the following (including the one below as well as one with the second setting uncommented) and in these cases I still got the error after restarting the client mid-workflow.

But I wonder if we can rule this out since the full workflow should take about 15s. Also, I should add that when it ends up in a failing loop following restart, each loop typically takes a few seconds at most.

ActivityOptions workflowOperationActivityOptions =
        ActivityOptions.newBuilder()
              .setScheduleToCloseTimeout(Duration.ofSeconds(600))
              //.setScheduleToStartTimeout(Duration.ofSeconds(600))
              .setStartToCloseTimeout(Duration.ofSeconds(600))
              .build();

private final WorkflowOperationActivity workflowOperationActivity =
        Workflow.newActivityStub(
              WorkflowOperationActivity.class,
              workflowOperationActivityOptions
        );

Will try to reproduce and get back to you. Can you share the Temporal Java SDK version you are using?

Thanks! I’m using version 1.18.0 and running the workflow via CronSchedule

I made another observation here: the same error occurs whether or not the workflow is executing on a cron schedule. Also, I’m only using the startToCloseTimeout setting now.

In an effort to control as many factors as I can, I was able to recreate this error in io.workshop.c4s8.ratelimitactivities.MyWorkflowImpl.

I did the following:

  1. minor code changes: moved the activities calls outside of their promises and extended the startToClose timeout
  2. set a breakpoint within the doB implementation
  3. executed the workflow until it reached the breakpoint and then closed the client: at this point the log showed one pending activity as expected
  4. restarted from main

This showed the same type of error that I was getting in my project:

workflow-method-RateLimited-d1b192fe-dde3-49f4-b2a1-64d6107c43ca: (BLOCKED on Feature.get)
app//io.temporal.internal.sync.WorkflowThreadContext.yield(WorkflowThreadContext.java:89)
app//io.temporal.internal.sync.WorkflowThreadImpl.yield(WorkflowThreadImpl.java:412)
app//io.temporal.internal.sync.WorkflowThread.await(WorkflowThread.java:45)
app//io.temporal.internal.sync.CompletablePromiseImpl.getImpl(CompletablePromiseImpl.java:84)
app//io.temporal.internal.sync.CompletablePromiseImpl.get(CompletablePromiseImpl.java:74)
app//io.temporal.internal.sync.ActivityStubBase.execute(ActivityStubBase.java:44)
app//io.temporal.internal.sync.ActivityInvocationHandler.lambda$getActivityFunc$0(ActivityInvocationHandler.java:77)
app//io.temporal.internal.sync.ActivityInvocationHandler$$Lambda$165/0x0000000800e77d60.apply(Unknown Source)
app//io.temporal.internal.sync.ActivityInvocationHandlerBase.invoke(ActivityInvocationHandlerBase.java:70)
app//com.sun.proxy.$Proxy9.doC(Unknown Source)
app//io.workshop.c4s8.ratelimitactivities.MyWorkflowImpl.exec(MyWorkflowImpl.java:30)
java.base@15.0.2/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
java.base@15.0.2/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
java.base@15.0.2/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.base@15.0.2/java.lang.reflect.Method.invoke(Method.java:564)
app//io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation$RootWorkflowInboundCallsInterceptor.execute(POJOWorkflowImplementationFactory.java:310)
app//io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation.execute(POJOWorkflowImplementationFactory.java:285)

This is not an error. This is a thread dump of workflow threads waiting on a Future. This is absolutely normal.

For me, this causes the workflow to continuously repeat itself, failing each time at the same location with the same error until it eventually times out. Ideally, the workflow would pick back up where it stopped and run to competition. I should mention, it does resume and complete fine if I’m not calling any activity methods.

Can you post the error? The above stack is not an error.

If I add some logs, it keeps repeating the same logs prior to the execution of the activity method that was interrupted when I stopped the client. Using the debugger, I can identify the following exception which is thrown when that activity method is called. This is caught in io.temporal.internal.sync.POJOWorkflowImplementationFactory.execute:

	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation$RootWorkflowInboundCallsInterceptor.execute(POJOWorkflowImplementationFactory.java:310)
	at io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation.execute(POJOWorkflowImplementationFactory.java:285)
	at io.temporal.internal.sync.WorkflowExecuteRunnable.run(WorkflowExecuteRunnable.java:67)
	at io.temporal.internal.sync.SyncWorkflow.lambda$start$0(SyncWorkflow.java:137)
	at io.temporal.internal.sync.CancellationScopeImpl.run(CancellationScopeImpl.java:101)
	at io.temporal.internal.sync.WorkflowThreadImpl$RunnableWrapper.run(WorkflowThreadImpl.java:113)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: io.temporal.internal.sync.DestroyWorkflowThreadError
	at io.temporal.internal.sync.WorkflowThreadContext.yield(WorkflowThreadContext.java:91)
	at io.temporal.internal.sync.WorkflowThreadImpl.yield(WorkflowThreadImpl.java:412)
	at io.temporal.internal.sync.WorkflowThread.await(WorkflowThread.java:45)
	at io.temporal.internal.sync.CompletablePromiseImpl.getImpl(CompletablePromiseImpl.java:84)
	at io.temporal.internal.sync.CompletablePromiseImpl.get(CompletablePromiseImpl.java:74)
	at io.temporal.internal.sync.ActivityStubBase.execute(ActivityStubBase.java:44)
	at io.temporal.internal.sync.ActivityInvocationHandler.lambda$getActivityFunc$0(ActivityInvocationHandler.java:77)
	at io.temporal.internal.sync.ActivityInvocationHandlerBase.invoke(ActivityInvocationHandlerBase.java:70)
	at com.sun.proxy.$Proxy8.doB(Unknown Source)
	at io.workshop.c4s8.ratelimitactivities.MyWorkflowImpl.exec(MyWorkflowImpl.java:34)
	... 16 more