There are failure scenarios when an activity task is lost before starting execution on the worker. Temporal has a hard guarantee that the activity task cannot be lost before the timer for its attempt has started. So, timeouts are the way to deal with all sorts of activity task delivery and activity worker failures.
The solution in your case is either to lower the StartToClose timeout to some shorter value or specify a short Heartbeat timeout. If you specify the heartbeat timeout, then the activity must call the heartbeat periodically without exceeding the specified timeout.
@maxim thank you for the response. Could you kindly elaborate on the scenarios where an activity task can end up in this state? Seems like its quite random and we did not notice this when there was not much load on the workers.
With regards to the potential solutions:
Having a short StartToClose timeout is not feasible since we cannot really predict the duration for which the activity will run but want to have a hard stop at 1 hour.
Heartbeats seem like a good way to indicate whether an activity is up and running. However, for our case - the activity makes a blocking call which runs for varied amounts of time. If we were to set a short heartbeat timeout, we will end up with a heartbeat timeout in the blocking step.
Is there any technique we can use in our scenario?
In this case, you can heartbeat from a different thread. Usually, we don’t recommend doing this as it does not detect the stuck main activity thread, but in your case, it is the only option.
Could you kindly elaborate on the scenarios where an activity task can end up in this state?
There are many scenarios. For example, the activity implementation function is invoked, and the process crashes before it makes any progress.