Cadence activity retries not working

Hello!

I have a workflow implementation in which we run long running workflows which does health-checks. We do heartbeating every 5 seconds and heartbeattimeout is 30 seconds. I have observed that many of my worklfows are getting closed(failed) with ActivityTimeoutException(type = heartbeat), when worker goes down(due to new deployment).

“level”:“ERROR”,“logger_name”:“com.uber.cadence.internal.sync.POJOWorkflowImplementationFactory”,“thread_name”:“workflow-root”,“message”:“Workflow execution failure WorkflowID=ea27167e-ca2f-4177-9ca5-22aba7c233d1, RunID=8e495171-bddf-43d5-85f3-58d3d8be45a9, WorkflowType=HealthCheckWorkflow::monitorHealthStatus”,“stack_trace”:“com.uber.cadence.workflow.ActivityTimeoutException: TimeoutType=HEARTBEAT, ActivityType=“HealthCheckActivityV2::startPeriodicHealthCheck”, ActivityID=“null”, EventID=8\n\tat java.base/java.lang.Thread.getStackTrace(Thread.java:1606)\n\tat com.uber.cadence.internal.sync.ActivityStubBase.execute(ActivityStubBase.java:46)\n\tat com.uber.cadence.internal.sync.ActivityStubImpl.execute(ActivityStubImpl.java:26)\n\tat com.uber.cadence.internal.sync.ActivityInvocationHandler.lambda$getActivityFunc$0(ActivityInvocationHandler.java:51)\n\tat com.uber.cadence.internal.sync.ActivityInvocationHandlerBase.invoke(ActivityInvocationHandlerBase.java:76)\n\tat com.sun.proxy.$Proxy23.startPeriodicHealthCheck(Unknown Source)\n\tat com.cloudera.cdp.healthCheck.workflow.HealthCheckWorkflowImpl.monitorHealthStatus(HealthCheckWorkflowImpl.java:101)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\n\tat com.uber.cadence.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation.execute(POJOWorkflowImplementationFactory.java:233)\n\tat com.uber.cadence.internal.sync.WorkflowRunnable.run(WorkflowRunnable.java:46)\n\tat com.uber.cadence.internal.sync.CancellationScopeImpl.run(CancellationScopeImpl.java:102)\n\tat com.uber.cadence.internal.sync.WorkflowThreadImpl$RunnableWrapper.run(WorkflowThreadImpl.java:85)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\n”

I have retries already configured for my activity.

@ActivityMethod(scheduleToCloseTimeoutSeconds = 631139040, heartbeatTimeoutSeconds = 30)
@MethodRetry(initialIntervalSeconds = 1, backoffCoefficient = 1, maximumAttempts = Integer.MAX_VALUE, maximumIntervalSeconds = 15)
void startPeriodicHealthCheck(Request monitorHealthRequest);

Please let me know how to avoid workflows getting completed with this failure.

Could you post the retry policy-related fields from the WorkflowExecutionStarted event?

Thanks @maxim . For workflow I did not set any retry policies. Please see screenshot below -

But I have retry policy for my async activity -

@maxim any suggestions what I might be doing wrong here?

Could you DM me or post here the whole workflow history?

The solution is to set expiration in retryoptions. The field is required and by default it is set to 0. Thanks Maxim for the help.

1 Like