How to set heartbeat timeout to handle heart beats and cancellation

Parent workflow → Child workflow using Async.procedure and promise.get()
Child workflow has a Sync activity that heartbeats every second.

With a heartbeat time out of 5seconds, we are seeing random heartbeat timeout errors
With a higher heartbeat time, cancellation doesn’t take effect until 80% of the heartbeat time out due to throttling.

We are using 1.0.7 temporal-sdk in java and unfortunately our temporal is on 1.23.1
We had to upgrade temporal server recently to fix security vulnerabilities but yet to upgrade the SDK

Also, does the heartbeat message/payload matter? I was sending the same heartbeat message in all heartbeats.

Any help would be greatly appreciated!

Modified the way in which Activity is invoked in child workflow. Invoked it now in asynchronous way and waiting on promise.get and I don’t see the heartbeat failures anymore.

How did this work?

The behavior you observerd is by design. Heartbeats are throttled up to 80% of the heartbeat timeout. Changing how an activity is invoked from the workflow doesn’t affect theactivity execution behavior (including heartbeating) at all.

1 Like

But we have a older workflow where the heartbeat timeout is 30 minutes which would mean the cancellation should take affect only after 24 minutes. and that workflow is invoking activity using async fashion.

But the cancellation takes affect very soon (few seconds). I verified it on the temporal web UI that the activity is scheduled with 30 minutes heartbeat time out and 5 days start to close time out.

This prompted me to try the new workflow to invoke activity async fashion and I stopped seeing the heartbeat time out failures.

I am not entirely sure why it started working but these are just my observations.

Does it have anything to do with temporal sdk vs server version mismatch or in worse case grpc dependency version?

Is it workflow invoking activity using async fashion or the activity is implemented using the manual completion client? The completion client doesn’t throttle heartbeats.

var activityOptions = ActivityOptions.newBuilder().setCancellationType(ActivityCancellationType.WAIT_CANCELLATION_COMPLETED).setHeartbeatTimeout(Duration.ofSeconds(5L)).setStartToCloseTimeout(Duration.ofMinutes(120L)).setRetryOptions(RetryOptions.newBuilder().setMaximumAttempts(1).setDoNotRetry(new String[0]).build()).setTaskQueue(“ActivityQueue”).build();

Promise activityPromise = Async.function(() → {
var activityStub = (ActivityClass)Workflow.newActivityStub(ActivityClass.class, activityOptions);
activityStub.activityMethod(activityParameters);
return null;
});

activityPromise.get();

This is how workflow is invoking activity. Sorry but I am not sure what manual completion client means.
Can this not be called as invoking activity using async fashion?

How the activity is invoked (synchronously or asynchronously) doesn’t affect the activity execution in any way. So, I don’t believe changing from sync to async can affect the heartbeating behavior. My guess is that your activities time out because some resource constraints make it miss heartbeats. 5 seconds is a pretty short interval, and any CPU throttling or network delay might cause the timeout.

1 Like

Would it have anything to do with temporal version upgrade or any 3rd party dependency upgrade such as io.grpc library or spring boot or temporal sdk?

We started seeing these errors only after temporal server upgrade and are sort of unsure what exactly is causing the problem.

Everything is possible. But I would look into resource exhaustion by the workers or the service, as well as the network.

I have tested on my laptop where we use Docker Desktop and we have loaded the application with more data heavy and performance intensive scenarios but haven’t seen timeout. But I am seeing these on very small test cases after the upgrade.

I have also tested on an Azure VM where there is no other process running and application is allocated 8GB of total 16GB and still see the timeouts happening. I haven’t seen any spikes as such in RAM and CPU in any of the pods especially the worker pod. Only doubt I had was some networking at the Kubernetes layer, but it is consistent and other non-temporal scenarios are working absolutely fine.