The issue we have faced is that when a worker dies (or is taken out of load balance), the next activities are not taken by any other worker (as it should be) and eventually the workflow fails due to a timeout on the activity not being taken.
We would like a mechanism to automatically recover from that, as it’s something it could happen fairly often. Any idea? We were thinking about some kind of reset of the workflow when the timeout is reached, but we don’t if it’s feasible or not.
Specify short activity ScheduleToStart timeout for these activities. When a worker dies the activity tasks are not picked up within this timeout and fail. Then workflow can redispatch them to another worker.
We had already set a ScheduleToStart timeout for these activities, but the workflow isn’t redispatched to another worker and it fails.
Since we’re using files that were downloaded to the worker in previously executed task queue activities, if the activity would be redispatched to a different worker it wouldn’t have the files, that’s why we wanted to reset the workflow from the start so it can be pick by any worker in the pool.
We had already set a ScheduleToStart timeout for these activities, but the workflow isn’t redispatched to another worker and it fails.
If a workflow is not redispatched to another worker it is going to timeout not fail. Are you sure that you handle the activity failure from this timeout appropriately?
True, we have set the timeout, but we are not handling it.
Researching, looks like TimeoutFailure with timeoutType TIMEOUT_TYPE_START_TO_CLOSE is what is thrown, so we would like to retry the whole workflow on that (correct?). In the samples, it’s retried for any error, and I don’t see a way to retry just on an specific exception + a field value (timeoutType).
Why do you want to retry the whole workflow on a single activity failure? Have you considered increasing that particular activity timeout to avoid it failing?
Following the Java samples provided, we’re using task queues, so Activity2 and Activity3 are executed in the same worker.
The issue we’re trying to avoid is that if a worker dies, all the workflows running in that host timeout.
Our understanding, is that we would need to re-run in a different the whole workflow, otherwise it won’t have the file downloaded in the Activity1.
The Java sample retries in any failure, but in our use case we would only want to retry on TimeoutFailure with timeoutType TIMEOUT_TYPE_START_TO_CLOSE, right?
The issue we’re trying to avoid is that if a worker dies, all the workflows running in that host timeout.
Let’s be precise on the terminology. Workflows don’t run on a specific worker. So if a worker dies workflows are not affected. The activities running on that worker will timeout and you want to retry the whole sequence on a different most as the example demonstrate.
The main timeout you want to see on the host specific task queue is SCHEDULE_TO_START as it ensures that an activity task is not going to get stuck in the queue for long if the host is down. I highly recommend reading the blog post (or watch associated video) that explains activity timeouts in detail.
The Java sample retries in any failure, but in our use case we would only want to retry on TimeoutFailure with timeoutType TIMEOUT_TYPE_START_TO_CLOSE , right?
I agree that Workflow.retry makes it hard to retry on an exception which is not a top level one, but chained to ActivityFailure. The workaround is to rethrow the cause:
Wouldn’t this snippet do the contrary of what we need? It will retry the workflow in anything but a TimeoutFailure (and also not taking into account the TIMEOUT_TYPE_SCHEDULE_TO_START).