Best practice for handling lost external events and resuming paused workflows

Hi Temporal Community,

I have a workflow that pauses while waiting for an asynchronous downstream operation to complete. The downstream system publishes an event to Kafka once the operation is done, and my service listens to this event and triggers a webhook to send a signal to resume the Temporal workflow.

However, there are cases where these downstream events might be lost or not delivered, causing the workflow to remain paused indefinitely and eventually time out.

To handle this, I’m exploring the idea of adding a polling fallback mechanism — so that when the workflow is reset or retried, it can poll the downstream system to verify if the operation has already completed and then resume accordingly.

I wanted to check:

  • Does Temporal provide any built-in support or recommended pattern for handling such scenarios where external signals might be lost?

  • If not, what would be the best practice to implement this kind of recovery or polling mechanism within a Temporal workflow?

Thanks in advance for your insights!

@maxim Also my polling frequency will be after every 1 hour.

@maxim I’m planning to implement periodic polling that runs every hour. My idea is to create a child workflow responsible for polling and notifying the parent workflow once it completes (as provided in samples-java repo), based on the downstream state. Do you see any potential concerns or drawbacks with this approach?

@maxim I’m implementing a parent workflow that must do two things at once:

  1. Kick off a child workflow that periodically polls some downstream state and returns true when the polling condition is met.

  2. Listen for an external signal that can set a workflowPaused flag at any time.

Current code (simplified):

PollingChildWorkflow childWorkflow =
  Workflow.newChildWorkflowStub(PollingChildWorkflow.class,
    ChildWorkflowOptions.newBuilder().setWorkflowId("ChildWorkflowPoll").build());

boolean success = childWorkflow.exec(pollingIntervalInSeconds);

Workflow.await(() -> success || workflowPaused);

The issue: childWorkflow.exec(...) blocks the parent until the child completes, so the parent can’t concurrently react to signals while the child is polling.

Question : what are the recommended patterns in Temporal to run the polling child and wait for a signal in parallel? For example — should I invoke the child asynchronously, have the child send a signal back to the parent on completion, convert the poller into an activity, use promises/async APIs, or something else? Any example snippets, pitfalls, or best practices would be really helpful.

Hi,

This post shows how to start a child workflow async in Java, Best way to create an async child workflow , sample code here .

Something like this should work

ChildWorkflow child = Workflow.newChildWorkflowStub(ChildWorkflow.class, childWorkflowOptions);
Promise<String> result = Async.function(child::executeChild);
result.thenApply(
    (String r) -> {
      done = true;
      return r;
    });


Workflow.await(() -> done || signal_received);

You can put the logic within a cancellation scope to cancel the timer if the signal arrives first.

Another approach can be start the child workflow after the workflow await times out,

Workflow.await(duration, () -> sinal_received);

//child workflow polling  here

so you give time to kafka to send the signal back to the parent workflow

Thanks @antonio.perez . I’m implementing periodic polling as a child workflow, which will be triggered by the parent workflow.

In the official Temporal Java samples, the polling logic is implemented directly inside the workflow code. However, that is not a good practice Polling in workflow vs. Activity? - #2 by maxim.

Since in my design the polling runs inside a child workflow, and the parent workflow only records the start and end events, am I correct in assuming that this approach won’t negatively impact the parent workflow’s history size or performance?

Hi,

In the official Temporal Java samples, the polling logic is implemented directly inside the workflow code

Could you show me where? note that this is relaying on activity retries , not calling an activity in a loop.

I correct in assuming that this approach won’t negatively impact the parent workflow’s history size or performance?

it will add three event to the workflow history ( StartChildWorkflowExecutionInitiated/Failed.., ChildWorkflowExecutionStarted, ChildWorkflowExecutionCompleted/Failed/Cancelled)

@antonio.perez samples-java/core/src/main/java/io/temporal/samples/polling/periodicsequence/PeriodicPollingChildWorkflowImpl.java at main · temporalio/samples-java · GitHub Here you can see we are calling activity for the number of attempts. Also let’s say If my parent workflow is running for 2-3 days will child workflow history impact if i’m using it for polling for every 1 hour ??

Thanks

Also let’s say If my parent workflow is running for 2-3 days will child workflow history impact if i’m using it for polling for every 1 hour ??

Sorry, does this answer your question or is this a different one?

it will add three event to the workflow history ( StartChildWorkflowExecutionInitiated/Failed.., ChildWorkflowExecutionStarted, ChildWorkflowExecutionCompleted/Failed/Cancelled)

I have the felling that you don’t need a child workflow, infrequent polling should do what you need

@antonio.perez For infrequent polling, I’ll need to start a new workflow as well, right? I was thinking of using a child workflow for this.

Can this be achieved without creating a new workflow? Are you suggesting that I should directly call an activity from my parent workflow instead? This will IMO as well add events like ActivityTaskScheduled/Completed per retry.

@maxim @antonio.perez

I have the following code snippet in my workflow:

Promise result = Async.function(pollHandlerActivity::poll, req);
Workflow.await(() → isPaused || result.get());

As I understand it, the activity will continue executing asynchronously in the background (including any retries, as per its retry policy). Meanwhile, the workflow will remain paused at the Workflow.await() line until either the activity completes (result.get() == true) or the isPaused flag becomes true.

From my perspective, when the predicate will be evaluated since I have used result.get() it will block until the activity completes. Is there any better way to do this ??

I think result.get() is not what you want here, because get() is a blocking call, it will wait until the activity has completed. I think what you want here is result.isCompleted(), which will tell you if the activity has completed or not.

I’m using temporal sdk - 1.23.1 and it don’t have newBuilder() in ApplicationFailure class. What is the alternative for this ??

Can I do something like

throw Activity.wrap(e);

@awwx When I use result.isCompleted(), if it evaluates to true, that could also mean the activity completed with an exception, not just successfully. In that case, my Workflow.await() condition would still evaluate to true, even though the activity actually failed — which isn’t what I want. So this approach wouldn’t work correctly for my use case.

To use automatic activity retries, you throw an exception from the activity if you check and the operation hasn’t completed yet. The activity won’t be retried if it completes normally, even if it returns false or something. samples-java/core/src/main/java/io/temporal/samples/polling/TestService.java at 626bf032cd168ffd353305a7662c2e72f6bc0ce1 · temporalio/samples-java · GitHub

Now the activity will only complete if the poll determines that the operation has completed, and the activity returns normally without throwing an exception.

If you don’t want to use infinite retries and so need to check whether the activity completed normally or with an exception, I think you’d want to do something like wait for isCompleted() and then check the result; I think just calling get() is going to block your await.

Workflow.await(() → isPaused || result.get());

I think this should work, the only problem I see is that .get will throw an exception is the activity fails (after all retries are exhausted) .

This is another approach

Promise<Boolean> result = Async.function(activity::execute);
        result.thenApply(
                (Boolean r) -> {
                    done = true;
                    return r;
                }).exceptionally((ex -> {

//your logic to handle the failure here
            done = false;
            return null;
        }));

    Workflow.await(() -> done || signal_received);

Thanks @antonio.perez result.thenApply is non-blocking right ??

Right, it returns a Promise without blocking

1 Like

@antonio.perez @maxim Is this workflow code deterministic.

Live Execution

  • Activity runs

  • Callback (thenApply) fires

  • done is set to true

  • await() resume

  • Workflow moves forward

Replay Execution (Happens later)

During replay:

  • Activity does not run

  • As a result thenApply(...) is never triggered

  • done.set(true) never happens

  • done.get() remains false

But since:

  • done = false

  • workflowPaused = false

The workflow enters an infinite wait.

Is my understanding correct ??