i’m confused… how would the typescript SDK be able to log when a workflow times out and is killed by the server? especially if it’s hung or a worker crashed, etc
What is the problem you are trying to solve? When a workflow fails or times out, this is reflected in the workflow execution history. Why do you need this logged?
I’d like to write a datadog alert to know when specific workflows fail or timeout.
For context, we have workflows that run as importers for any number of tenants. Once a tenant defines them, they run on cron schedules in temporal. So one they start running, if they time out or fail, it’s always an error we should know about.
We recently ran into an issue where one tenant’s import workflow failed to finish within the startToClose timeout, and it would just hang when temporal subsequently retried, including when the cron started a fresh workflow.
Either way, we’re trying to make sure that we can alert our oncall engineer if an import workflow fails or times out.
Workflow timeouts should not be used for business logic. This is a protection mechanism to avoid runaway resources. So in your case I would recommend changing workflow timeout to a timer that fails workflow. Then you can change your workflow to either emit metrics or notify you any other way on failure. If you want a more generic solution put this logic into an interceptor.
I’m a bit confused… I just want to know when a workflow fails or times out, and what the workflow name was. I don’t want to trigger any business logic when that occurs.
Another question, it looks like interceptors run on the worker, is that correct? I was hoping for a way to detect timeout/failure even if the worker crashes or workers become unavailable, but if that’s not possible I’ll just use interceptors.
Do you have any pointers towards which call I would intercept to determine when a workflow has timed out?
In this case metrics for timed out and terminated workflows are already reported by the Temporal service (or cloud), so you can import them to Datadog and set any alerts you want.
@maxim thank you for all of the help so far – is there any other way to get failure by workflow name from the logs/metrics? if not, is there any chance your team would be willing to add workflow name as a dimension of the metrics?
So, if you really need to act when a specific workflow fails, then using try-catch or an interceptor is the way to go. Don’t use workflow timeout, but use a timer within the workflow code to fail it after the specified interval.