How to monitor ScheduleToStart latency

Temporal Server self-hosted production deployment | Temporal Documentation recommends monitoring ScheduleToStart latency. Looking into the temporal metrics in prometheus, I see 2 metrics - temporal_activity_schedule_to_start_latency_bucket and temporal_workflow_task_schedule_to_start_latency_bucket metrics. I am not sure if these are right metrics to monitor. When I run a maru scenario on our cluster these metrics does not reflect the actual workflows and activities triggered. Are these the right metrics to represent ScheduleToStart latency or am I missing something here?

On further investigation, I see the ScheduleToStart latency metrics are reported by client and worker. To expose these via prometheus (from our worker and client services) we need to customise the Scope while initialising the client - samples-go/main.go at master · temporalio/samples-go · GitHub. Will try this.

Hi,
I am trying to achieve the same using Java SDK. I tried to initialize the metrics scope as below in sample example - HelloActivity.java. But still was unable to find the ScheduleToStart latency metric in Prometheus server which is deployed using Temporal helm chart.

PrometheusMeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
    StatsReporter reporter = new MicrometerClientStatsReporter(registry);

    Scope scope =
        new RootScopeBuilder()
            .reporter(reporter)
            .reportEvery(com.uber.m3.util.Duration.ofSeconds(1));

    // Get a Workflow service stub.
    WorkflowServiceStubs service =
        WorkflowServiceStubs.newInstance(
            WorkflowServiceStubsOptions.newBuilder()
                .setMetricsScope(scope)
                .build());

I believe there needs to be additional configuration in the helm deployment to indicate about prometheus metrics coming via client workflow/worker applications. Can someone help to point me the issue.

Take a look at the metrics sample that sets up sdk metrics (in both the worker and starter), as well as sets up http server to push sdk metrics to an endpoint that can be scraped by prometheus. Let me know if it helps.

The SDK metrics you are looking for are:
"temporal_workflow_task_schedule_to_start_latency"
and
"temporal_activity_schedule_to_start_latency"

(note sdk metrics have by default a "temporal_" prefix)

@tihomir I referred the above sample, but I am not able to see “temporal_activity_schedule_to_start_latency”, rather can see the other ones i.e “temporal_activity_schedule_to_start_latency_bucket” , “temporal_activity_schedule_to_start_latency_count”, “temporal_activity_schedule_to_start_latency_sum”
Same is the case for workflow latency metrics.

In the worker application, I have initialized the WorkflowServiceStubs by setting the metrics scope. But is there any additional configuration required in helm chart to denote about the prometheus metrics coming from worker application??

Those buckets/counts/sum are created due to default reporting type being “histogram”. You could change it by providing custom PrometheusConfig when you create a new registry. The way sample uses it now:

PrometheusMeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);

it gets the default HistogramFlavor.Prometheus. See for example here for a little more info on histogram.

I believe since you deploy your applications yourself, you need to make sure that the metrics they expose are on endpoints that prometheus can scrape. In your prometheus config then you need to set up all these scrape points,
for server metrics, but also for all endpoints for metrics that your applications expose (workers, starters etc) . For example in your prometheus config:

global:
  scrape_interval: 5s
  external_labels:
    monitor: 'temporal-monitor'
scrape_configs:
  - job_name: 'prometheus'
    metrics_path: /metrics
    scheme: http
    static_configs:
      - targets:
          - 'temporal:8000'
          - 'worker:8001'
          - 'starter:8002'

The above metrics are being populated by temporal server itself. When I enable metric endpoint on my worker client, the following metrics are generated (seconds being appended in the end):
temporal_workflow_task_schedule_to_start_latency_seconds
temporal_activity_schedule_to_start_latency_seconds
What is the difference between these and the ones you have mentioned above?

Got the answer while looking for tags for these metrics in prometheus.

The “seconds” metrics are the ones with my workflow’s task queue.

These should be SDK metrics, not server metrics. Yes, workflow/activity task latencies.

1 Like

Hey Tihomir, am I right to assume that these SDK metrics’( temporal_workflow_task_schedule_to_start_latency_seconds
temporal_activity_schedule_to_start_latency_seconds) histogram buckets are in seconds?

Yes, thats my understanding as well.

1 Like