I want Advice on Handling Complex Workflow Failures in Temporal

Hi everyone,

I have been experimenting with building workflows that involve multiple dependent activities. While the basics are working fine; I am facing issue when it comes to handling failures in a clean & efficient way. I want to retry it a certain number of times before moving on but I also need to ensure that the overall workflow does not get stuck.

I have been reading through the docs & trying out different retry policies but I still feel such as I am missing something practical. Do most developers here rely on custom error handling logic or do you let Temporal’s built-in retry mechanisms handle the heavy lifting? Also; how do you usually debug tricky scenarios when retries succeed but cause unexpected side effects?

I am also preparing for a CCSP Course so I want to know if any best practices overlap between workflow security & cloud security. Also i have check this Breakpoints in @workflow.run Not Triggering in Temporal Python SDK (but work in Activities) still need advice.

Thank you.:slight_smile:

The majority of developers rely on built in activity retries if they need to retry individual activities.

If you need to retry a sequence of activities, these retries are usually performed from a workflow, or this logic is moved to a child workflow.

I want to retry it a certain number of times before moving on but I also need to ensure that the overall workflow does not get stuck.

This should be pretty straightforward with Temporal SDKs. Is it a general question or you have a specific issue with this?

Also; how do you usually debug tricky scenarios when retries succeed but cause unexpected side effects?

You activities have to be idempotent. I don’t think there is a general approach of ensuring and troubleshooting issues with idempotency.

I want to know if any best practices overlap between workflow security & cloud security.

Do you have a specific question about these?