How to set heartbeat timeout to handle heart beats and cancellation

karnati_harish · May 16, 2024, 4:52am

Parent workflow → Child workflow using Async.procedure and promise.get()
Child workflow has a Sync activity that heartbeats every second.

With a heartbeat time out of 5seconds, we are seeing random heartbeat timeout errors
With a higher heartbeat time, cancellation doesn’t take effect until 80% of the heartbeat time out due to throttling.

We are using 1.0.7 temporal-sdk in java and unfortunately our temporal is on 1.23.1
We had to upgrade temporal server recently to fix security vulnerabilities but yet to upgrade the SDK

Also, does the heartbeat message/payload matter? I was sending the same heartbeat message in all heartbeats.

Any help would be greatly appreciated!

karnati_harish · May 16, 2024, 6:08pm

Modified the way in which Activity is invoked in child workflow. Invoked it now in asynchronous way and waiting on promise.get and I don’t see the heartbeat failures anymore.

How did this work?

maxim · May 16, 2024, 9:36pm

The behavior you observerd is by design. Heartbeats are throttled up to 80% of the heartbeat timeout. Changing how an activity is invoked from the workflow doesn’t affect theactivity execution behavior (including heartbeating) at all.

karnati_harish · May 21, 2024, 6:31am

But we have a older workflow where the heartbeat timeout is 30 minutes which would mean the cancellation should take affect only after 24 minutes. and that workflow is invoking activity using async fashion.

But the cancellation takes affect very soon (few seconds). I verified it on the temporal web UI that the activity is scheduled with 30 minutes heartbeat time out and 5 days start to close time out.

This prompted me to try the new workflow to invoke activity async fashion and I stopped seeing the heartbeat time out failures.

I am not entirely sure why it started working but these are just my observations.

Does it have anything to do with temporal sdk vs server version mismatch or in worse case grpc dependency version?

maxim · May 22, 2024, 6:00pm

Is it workflow invoking activity using async fashion or the activity is implemented using the manual completion client? The completion client doesn’t throttle heartbeats.

karnati_harish · May 27, 2024, 12:32pm

var activityOptions = ActivityOptions.newBuilder().setCancellationType(ActivityCancellationType.WAIT_CANCELLATION_COMPLETED).setHeartbeatTimeout(Duration.ofSeconds(5L)).setStartToCloseTimeout(Duration.ofMinutes(120L)).setRetryOptions(RetryOptions.newBuilder().setMaximumAttempts(1).setDoNotRetry(new String[0]).build()).setTaskQueue(“ActivityQueue”).build();

Promise activityPromise = Async.function(() → {
var activityStub = (ActivityClass)Workflow.newActivityStub(ActivityClass.class, activityOptions);
activityStub.activityMethod(activityParameters);
return null;
});

activityPromise.get();

This is how workflow is invoking activity. Sorry but I am not sure what manual completion client means.
Can this not be called as invoking activity using async fashion?

maxim · May 28, 2024, 5:16am

How the activity is invoked (synchronously or asynchronously) doesn’t affect the activity execution in any way. So, I don’t believe changing from sync to async can affect the heartbeating behavior. My guess is that your activities time out because some resource constraints make it miss heartbeats. 5 seconds is a pretty short interval, and any CPU throttling or network delay might cause the timeout.

karnati_harish · May 28, 2024, 6:05am

Would it have anything to do with temporal version upgrade or any 3rd party dependency upgrade such as io.grpc library or spring boot or temporal sdk?

We started seeing these errors only after temporal server upgrade and are sort of unsure what exactly is causing the problem.

maxim · May 28, 2024, 6:16am

Everything is possible. But I would look into resource exhaustion by the workers or the service, as well as the network.

karnati_harish · May 28, 2024, 6:35am

I have tested on my laptop where we use Docker Desktop and we have loaded the application with more data heavy and performance intensive scenarios but haven’t seen timeout. But I am seeing these on very small test cases after the upgrade.

I have also tested on an Azure VM where there is no other process running and application is allocated 8GB of total 16GB and still see the timeouts happening. I haven’t seen any spikes as such in RAM and CPU in any of the pods especially the worker pod. Only doubt I had was some networking at the Kubernetes layer, but it is consistent and other non-temporal scenarios are working absolutely fine.

Topic		Replies	Views
Activity Cancellation Heartbeat Community Support java-sdk	1	85	September 21, 2024
Best practices for long-running activities Community Support java-sdk , activity , best-practices	9	5526	March 22, 2024
Valid use of a Heartbeat timer? Community Support java-sdk	7	151	November 15, 2024
Issue with cancellation and activity running multiple times Community Support go-sdk	7	2324	August 24, 2020
Heartbeat is Never Being Sent Community Support python-sdk	22	3310	March 12, 2025

How to set heartbeat timeout to handle heart beats and cancellation

Related topics