When we start multiple workflows some activities get stuck with this error in stack trace. The activity shows as “started” in workflow history tab but eventually timeouts on scheduleToClose timeout.
we are running on
Server 1.12.0
Java SDK 1.7.1
workflow-method-dfce73d9-bc67-4567-bcdc-deba363a593-c8a29ca1-13e2-40a8-9208-b07141c1ffb8: (BLOCKED on Feature.get)
io.temporal.internal.sync.CompletablePromiseImpl.getImpl(CompletablePromiseImpl.java:84)
io.temporal.internal.sync.CompletablePromiseImpl.get(CompletablePromiseImpl.java:74)
Hi @atullimaye19 could you show your workflow history and workflow code? Looking to see if your code follows these constraints. Could you also share the full stack trace?
@tihomir - Thanks for the reply. Not all workflows fail with this error but say if we submit 10, 4 will get stuck others will go through just fine. WF history is too big to attach here.
here is the full stack trace -
workflow-method-4bae6cc9-5d85-4ed5-beae-6c189fd5a45-9796b1b1-fdc8-4199-a6e8-0bc2438a2088: (BLOCKED on Feature.get)
io.temporal.internal.sync.CompletablePromiseImpl.getImpl(CompletablePromiseImpl.java:84)
io.temporal.internal.sync.CompletablePromiseImpl.get(CompletablePromiseImpl.java:74)
io.temporal.internal.sync.ActivityStubBase.execute(ActivityStubBase.java:44)
io.temporal.internal.sync.ActivityInvocationHandler.lambda$getActivityFunc$0(ActivityInvocationHandler.java:77)
io.temporal.internal.sync.ActivityInvocationHandler$$Lambda$1192/0x0000000100f6c840.apply(Unknown Source)
io.temporal.internal.sync.ActivityInvocationHandlerBase.invoke(ActivityInvocationHandlerBase.java:70)
com.sun.proxy.$Proxy141.canvertToXYZRequest(Unknown Source)
com.workflow.impl.MyWorkflowImpl.executeWorkflow(MyWorkflowImpl.java:102)
java.base@11.0.11/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
java.base@11.0.11/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
java.base@11.0.11/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.base@11.0.11/java.lang.reflect.Method.invoke(Method.java:566)
io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation$RootWorkflowInboundCallsInterceptor.execute(POJOWorkflowImplementationFactory.java:317)
io.temporal.common.interceptors.WorkflowInboundCallsInterceptorBase.execute(WorkflowInboundCallsInterceptorBase.java:37)
com.workflow.MyWorkflowInterceptor.execute(MyWorkflowInterceptor.java:20)
It’s hard to tell from the stack trace imo. This could be caused by non-deterministic code (see similar post here where user used system time), or it could be a workflow replay issue where a certain condition is not met for the Future.get to get unblocked (thread block on an un-complete promise).
Would also be worth trying the latest SDK release 1.8.1 to see if you get the same issue.
The stack trace posted is not an error stack trace. It is a thread dump of the workflow and it shows that this workflow is waiting for canvertToXYZRequest activity to complete.
I am getting this error like <1% of the times still. Rest of the time same code works perfectly fine.
SDK : java-sdk 1.22.3
I have a workflow which invokes 3-4 activities in following manner. Workflow completes when all activities are completed. Some time (1 out of 100) workflow stuks into running state due to one or more of the actives get stuck. No error in console, but when check temporal UI stack trace section , I seen error as mentioned below.
Workflow:
WorkflowClient.start(workflow::postFlow, param1)
OR
return workflow.postFlow(param1)
Activities Invocation:
static activityOptions () {
return ActivityOptions.newBuilder()
.setRetryOptions(RetryOptions.newBuilder()
.setBackoffCoefficient(2)
.setMaximumInterval(Duration.ofMinutes(30))
.setInitialInterval(Duration.ofSeconds(5))
.setDoNotRetry(noRetry)
.build())
.setTaskQueue(queueName)
.setScheduleToCloseTimeout(Duration.ofHours(48)).build();
}
var activity1 = Async.function(
initActivity(Activity1.class, activityOptions(ACTIVITY1_QUEUE, EXCEPTION_CLASS)
)::performActivity1, param1, param2, param3);
var activity2 = Async.function(
initActivity(Activity2.class, activityOptions(ACTIVITY2_QUEUE)
)::performActivity2, param1, param2);
.
.
.
var activityn = as above
Promise.allOf(actvity1, activity2 .... activityn).get()
or
new Response(actvity1.get(), actvity2.get() ...)
Activity optoins which I have specified in my post are the exact configuration of timeouts. But 95 percentile for any of the activity response time is within 5 sec. In some cases if there are any issues with backend APIs, then I have kept a margin and possiblity of 48 to address the issue and let temporal retry it.
Backoff Co : 2
Max Interval : 30 min
Init Interval : 5 sec
Schedule to close : 48 hours( for all activity)
It seems workflow is waiting for activity to complete. However, the activity has completed , example if tas needs to update db, even though in temporal is says ActivityTaskSchedules event, but it has updated DB. But the activity in temporal Ui when I check I don’t see ActivityTaskStarted & Completed events and only ActivityTaskScheduled is visible. And when I check stack trace it shows as below. Due to this workflow is in running state till activity times out after 48 hours.
To summarise, the business functionality is working fine. However, we are getting false positive due to activity state. How can I make is complete in case it shows stuck are there any configuration which are incorrect or we need some workaround to check activity status in case it shows stuck.
You didn’t specify StartToClose timeout. So it defaulted to ScheduleToClose, which is 48 hours. So on a worker restart or crash, the activity will timeout and then retry after 48 hours. So your activities are not stuck, they just wait for the timeout.
Thanks @maxim.
I thoughts its takes some default value for startToCloseTimout. I will specify that value. A question though, I read the document for defining both startToCloseTimeout and scheduledToClose timeout. But I couldn’t make out can I specify both and what should be ration.
I understood that, startToCloseTimeout is important in my case, which I need to correct but what should be the ratio of startToCloseTimeout vs scheduledToClose. Here I have schedules 48 hours for activity to timeout , which mean after that my activity fails and never retries. Since my activity execution takes 5-10 sec. Can I specify 1min or something as startToCloseTimeout.
Thanks once again in advance. This is really helpful information.
ScheduleToCloseTimeout is how long a workflow is willing to wait for an activity to complete in the presence of failures and retries. StartToCloseTimeout is the maximum time a single activity invocation can take. If an activity can run for a long time we recommend specifying the HeartbeatTimeout and implementing heartbeating from the activity.
The problem with pending activity is resolved by specified startToCloseTimeout. However, I have see a new issue for some 9 workflow out of ~3.5k workflows run.
Issue is, the activity is attempted twice with first one throwing ‘activity StartToClose timeout’ error. Upon next attempt I get duplicate key issue as the activity is to insert a record in db for given id. Can you please help me understand is there something I can do in configuration to avoid retry for same id or what does this means in general.
Apparently the data is inserted in first attempt itself, but server is not notified of the completion hence second time there is error for duplicate key, which is obvious.
Your activity implementation should be idempotent as in a distributed system there is always a possibility of this activity running more than once unless you disable retries. But when retries are disabled, there is the possibility that the activity never executes.