How to monitor ScheduleToStart latency

siva · October 25, 2021, 5:15am

Temporal Server self-hosted production deployment | Temporal Documentation recommends monitoring ScheduleToStart latency. Looking into the temporal metrics in prometheus, I see 2 metrics - temporal_activity_schedule_to_start_latency_bucket and temporal_workflow_task_schedule_to_start_latency_bucket metrics. I am not sure if these are right metrics to monitor. When I run a maru scenario on our cluster these metrics does not reflect the actual workflows and activities triggered. Are these the right metrics to represent ScheduleToStart latency or am I missing something here?

siva · October 25, 2021, 8:23am

On further investigation, I see the ScheduleToStart latency metrics are reported by client and worker. To expose these via prometheus (from our worker and client services) we need to customise the Scope while initialising the client - samples-go/main.go at master · temporalio/samples-go · GitHub. Will try this.

poojabhutada · January 27, 2022, 10:47am

Hi,
I am trying to achieve the same using Java SDK. I tried to initialize the metrics scope as below in sample example - HelloActivity.java. But still was unable to find the ScheduleToStart latency metric in Prometheus server which is deployed using Temporal helm chart.

PrometheusMeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
    StatsReporter reporter = new MicrometerClientStatsReporter(registry);

    Scope scope =
        new RootScopeBuilder()
            .reporter(reporter)
            .reportEvery(com.uber.m3.util.Duration.ofSeconds(1));

    // Get a Workflow service stub.
    WorkflowServiceStubs service =
        WorkflowServiceStubs.newInstance(
            WorkflowServiceStubsOptions.newBuilder()
                .setMetricsScope(scope)
                .build());

I believe there needs to be additional configuration in the helm deployment to indicate about prometheus metrics coming via client workflow/worker applications. Can someone help to point me the issue.

tihomir · January 27, 2022, 8:49pm

Take a look at the metrics sample that sets up sdk metrics (in both the worker and starter), as well as sets up http server to push sdk metrics to an endpoint that can be scraped by prometheus. Let me know if it helps.

The SDK metrics you are looking for are:
"temporal_workflow_task_schedule_to_start_latency"
and
"temporal_activity_schedule_to_start_latency"

(note sdk metrics have by default a "temporal_" prefix)

poojabhutada · January 31, 2022, 11:59am

@tihomir I referred the above sample, but I am not able to see “temporal_activity_schedule_to_start_latency”, rather can see the other ones i.e “temporal_activity_schedule_to_start_latency_bucket” , “temporal_activity_schedule_to_start_latency_count”, “temporal_activity_schedule_to_start_latency_sum”
Same is the case for workflow latency metrics.

In the worker application, I have initialized the WorkflowServiceStubs by setting the metrics scope. But is there any additional configuration required in helm chart to denote about the prometheus metrics coming from worker application??

tihomir · January 31, 2022, 4:20pm

Those buckets/counts/sum are created due to default reporting type being “histogram”. You could change it by providing custom PrometheusConfig when you create a new registry. The way sample uses it now:

PrometheusMeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);

it gets the default HistogramFlavor.Prometheus. See for example here for a little more info on histogram.

I believe since you deploy your applications yourself, you need to make sure that the metrics they expose are on endpoints that prometheus can scrape. In your prometheus config then you need to set up all these scrape points,
for server metrics, but also for all endpoints for metrics that your applications expose (workers, starters etc) . For example in your prometheus config:

global:
  scrape_interval: 5s
  external_labels:
    monitor: 'temporal-monitor'
scrape_configs:
  - job_name: 'prometheus'
    metrics_path: /metrics
    scheme: http
    static_configs:
      - targets:
          - 'temporal:8000'
          - 'worker:8001'
          - 'starter:8002'

Ruchir · February 23, 2022, 1:08pm

The above metrics are being populated by temporal server itself. When I enable metric endpoint on my worker client, the following metrics are generated (seconds being appended in the end):
temporal_workflow_task_schedule_to_start_latency_seconds
temporal_activity_schedule_to_start_latency_seconds
What is the difference between these and the ones you have mentioned above?

Ruchir · February 23, 2022, 1:12pm

Got the answer while looking for tags for these metrics in prometheus.

The “seconds” metrics are the ones with my workflow’s task queue.

tihomir · February 23, 2022, 6:19pm

These should be SDK metrics, not server metrics. Yes, workflow/activity task latencies.

Ruchir · February 28, 2022, 4:56am

Hey Tihomir, am I right to assume that these SDK metrics’( temporal_workflow_task_schedule_to_start_latency_seconds
temporal_activity_schedule_to_start_latency_seconds) histogram buckets are in seconds?

tihomir · February 28, 2022, 6:12am

Yes, thats my understanding as well.

Topic		Replies	Views
Clarification on metrics (client + server) Community Support java-sdk , metrics	14	2678	April 13, 2022
Missing workflow data from /metrics Community Support metrics	4	1700	March 26, 2021
Troubles shoot of workflow execution latency Community Support performance	1	890	August 11, 2022
Suggested metrics to autoscale Temporal workers on Community Support general-impl , metrics , kubernetes	9	8056	January 3, 2024
Prom metrics missing using python worker and go workflow Community Support general-impl , kubernetes	1	716	June 28, 2022

How to monitor ScheduleToStart latency

Related topics