Alerting Strategy without requiring workflows to fail

I’d have a question about Monitoring and Alerting of Workflows that either take too long or exceed a number of activity retries.

General Workflow Philosophy
My idea for designing workflows was to never let them fail, and instead let it retry forever in case of activity errors. In some cases, a bounded retry might also be ok, but in such a scenario like to alert already earlier, before the workflow fails.
A small number of retries per activity is considered ok. However, if it takes longer than a defined threshold (e.g. longer than 30 minutes, more than 3 attempts etc), it can be considered a problem that should trigger an alert and needs to be looked into.

Alerting on Problems in activities
There exists already a metric activity_execution_failed, which counts for specific activities, how often it failed. It helps to see in general if there are “more problems within a certain activity”, but it does not show if it is just 1 faulty workflow which retried 100 times, of if there are 25 workflows which exceeded 4 retries. I also don’t see if the activity of a certain workflow is still failing, or if it eventually succeeded.

What I generally would be interested in is the answer to the question “how many workflows have problems, and in which activity”. where “having a problem” is some threshold, that can be defined per activity, without requiring the workflows to fail. Once the activity eventually succeeded, it should not count for the monitoring any more.

For getting the number of affected workflows as a metric, is this something that should be obtained somehow from elasticsearch instead of the prometheus metrics, or is there some other way?

1 Like

if it takes longer than a defined threshold

You can monitor activity execution latencies via SDK activity_execution_latency metric.
You can filter it by workflow_type and activity_type properties if needed.

For retry attempts, you can build that into your activity code and produce a custom metric, see here for a simple Java SDK example. You can get the workflow type and activity type inside your activity code.

Another important SDK metric for activities is activity_schedule_to_start_latency which an indicate possible issues with your workers being able to process activity tasks.

You can also use SDK client apis to for example get a list of running workflow executions that have pending activities with retry count > x. Here is a small example again using Java SDK if it helps.

For getting the number of affected workflows as a metric, is this something that should be obtained somehow from elasticsearch instead of the prometheus metrics, or is there some other way?

I’m don’t think it’s possible to filter metrics by the current workflow status, so if this is a requirement I think using List apis (visibility data) as shown would be the way to go.

1 Like

Thanks @tihomir for the great answer.

What I envision would be a dashboard showing the following:

  • number of “unhealthy” workflows over time (that have activities which exceed a certain threshold) or number of “unhealthy” activities over time
  • grouped by WorkflowType/ActivityType

(Custom) SDK metrics seem to be a good choice when one is interested in finding out the number of activity retries exceeding an “alertworthy” threshold within e.g. the last hour.

This seems to be a better method to get metrics about the overall health of workflows (meaning I could get the total number of Workflows). However, I’m a bit worried if such an enumerative approach could be negative for performance?

I also found out about Advanced Visibility Features like List Filters, which are backed by elastic search.
Is it possible and recommended to set custom search attributes for Workflows with activities that need too many retries? And should I for that directly integrate ElasticSearch in Grafana?