I followed the example here https://github.com/temporalio/samples-java/blob/master/src/main/java/io/temporal/samples/hello/HelloCron.java
The cron runs every minute. I modified the execution and run timeout to this
.setWorkflowExecutionTimeout(Duration.ofMinutes(5))
.setWorkflowRunTimeout(Duration.ofSeconds(10))
I let it run for 2 minutes and then killed the worker and here is what I got
How can I detect that the run timeout happened and raise some alert? And what if I want to alert only when 2 consecutive run failed/timed out?
Thank you in advance!
samar
November 13, 2020, 12:24am
2
We emit a workflow_timeout
metric tagged with namespace. Unfortunately this is not hooked up for cron workflows yet. Issue#397 tracks this improvement. Once it is fixed you should be able to use this metric to track cron workflows timing out.
Thank you very much samar
Is there a doc around how to use metric to setup monitoring and alerts?
Or I should just learn one of the systems listed here? https://docs.temporal.io/docs/configure-temporal-server/#metrics
samar
November 13, 2020, 5:15pm
4
Lots of users are running Temporal with prometheus as metric backend. Please refer to our helm-charts for documentation on setting it up. If you search on forum you will find a lot of information on this topic. Few examples:
basically your question is how to configure “bring your own prometheus”
i am not too sure, but each component/Role has a promethus section right, will you not be able to provide the prmoethus endpoint there?
frontend:
# replicaCount: 1
service:
type: ClusterIP
port: 7233
metrics:
annotations:
enabled: true
serviceMonitor: {}
prometheus: {//HERE GOES YOUR STUFF??}
# enabled: false
Can you offer some guidance on using prometheus+grafana to monitor Temporal. Some topics that would be useful:
what are the top tier metrics that must be monitored
a brief high level description of what each top-tier metric is measuring
what are the units being displayed
real-world experience on correlating user-visible issues to metrics that can help diagnose those issues
Any help on this topic would be much appreciated!
I am looking for the following metrics to alarm/monitor on for my workflows
Workflow time out
Activity times out
Current Task queue size (to alarm/autoscale when backlog is large)
I was not able to find these as part of the client metrics . I am guessing these are available as part of the server metrics, hopefully broken down by namespace. Could any one point me to the the correct metrics and any documentation available for server emitted metrics
Hi, there is not much documentation around metrics, the statd which was removed too seemed very complex.
I want to to understand
a)what metrics does temporal expose by default
b) are the metrics namespace specific?
c) can i get queue /task list specific metrics?
d) how to consume them in prometheus.
e) if i am to develop custom metrics what’s the best way, should those be activities in workflows or interceptors?
We are using Datadog. Any support on that.
Google search shows nothing and search datadog in the forum got nothing.
I’ll ask a separated question