Hello. I have started using the schedules feature and have found it really beneficial. I want to have some observability on it though, and preferably have metrics that expose each schedule and its latest status, or last successful run time. In essence I want to be alerted if a specific schedule fails, or stays running for longer than necessary. I found schedule_action_success in metrics, but it doesn’t seem to have any labels that would allow me to check a single schedule. I am using Python SDK. What can I do to achieve this?
Hi Marya,
I understand you mean the workflow started by the schedule, in addition to the schedule metrics, you can use a combination of server and custom metrics.
in essence I want to be alerted if a specific schedule fails,
For this you can use server metrics workflow_failed and workflow_terminated , they all have workflow_type tag.
or stays running for longer than necessary
One option is to use visibility API to query workflows started by the schedule XYZ more than X hours ago (ExecutionStatus="Running" and TemporalScheduledById="XYZ" and StartTime<=" $now -5h "
) and publish your custom metric.
Or create a timer when the workflow start and send an alert(from an activity or local activity) if the timer fires. The issue with this approach is that the code after the timer might not be executed if the workflow is stuck with a NDE for example.
Additionally you can always create custom metrics based on your needs from workflow code and activities, in python:
workflow.metric_meter
and activity.metric_meter
Antonio