Question about getting some metrics using the SDK

Hello, I have a few metrics that I am trying to get using the Java SDK. I wonder if it’s possible to get these metrics since I need to use these to perform additional calculations.

The metrics I need are:

  • schedule_to_start_latency
  • number of open workflows
  • available worker count
  • start_to_close_latency

If this is not possible through the SDK, is there another way I can get these information? Thanks in advance

Hi @aalmacin

Here is a documentation that outlines which SDK is available to each Metric name: SDK metrics | Temporal Documentation

Please let me know if you are able to pinpoint the outlined metrics that you need.

Best Regards,
Jordan

Hi @jreynolds23 , thanks for the response. I looked at the SDK metrics but still not sure if I can use those on our service. I am writing a formula in our service in Java and I basically need to know

  • How long in seconds a workflow takes from being added to the queue until completed. I am not sure if there’s a metric for this.
  • How long the workflow stays on the queue before starting. I assume I can use schedule_to_start_latency for this but not sure if that will be for a single workflow or for more
  • I found a way to get the number of workflow using
    tctl workflow listall --open --print_json | jq length
    so I think this is good for us. Although, if there is a way using the Java SDK to get this, it would be better
  • number of workers currently running. We can also keep track of this manually but also would be nice to know if you have an API or if it’s on the SDK

Please let me know if you know where can I get these. Thanks again in advance

How long in seconds a workflow takes from being added to the queue until completed. I am not sure if there’s a metric for this.

Can use workflow_endtoend_latency
Note all SDK metrics are prefaced with temporal_, for Java SDK, this metric is in measured in secods and that each workflow retry (if you specify workflow retries via WorkflowOptions->RetryOptions) is counted independently (as an independent execution), just fyi.

How long the workflow stays on the queue before starting. I assume I can use schedule_to_start_latency for this but not sure if that will be for a single workflow or for more

Yes, on SDK metrics side you can use workflow_task_schedule_to_start_latency measures latency for a workflow task from when its placed on task queue until it is picked up for processing by worker.
On server side you can look at task_schedule_to_start_latency for RecordWorkflowTaskStarted operation and task_type Workflow, but note this is on task queue level (does not contain workflow info).

Although, if there is a way using the Java SDK to get this, it would be better

Yes, you can use ListWorkflowExecutions with visibility query or ListOpenWorkflowExecutions (not preferred as should be deprecated in future). See example here. Your query could be something like:
ExecutionStatus="Running" (all running execusitons)
or
WorkflowType="MyWorkflowType" AND ExecutionStatus="Running" (for specific workflow type). You can see list of all search attributes via
tctl cl gsa

number of workers currently running

You can see how many workers/pollers are currently polling on a task queue via:
tctl tq desc --tq <my_tq_name>
or SDK by using DescribeTaskQueueRequest, sample here.
From SDK metrics you can also get total number of pollers started via
poller_start .

Thanks for that answer @tihomir. This is a huge help to our development. As you mentioned, I started using the SDK to get the open workflows. The problem I run into is that it always return 1000 which I think is the limit. We need to know how many open workflows are in total. Similar to what we get by running tctl workflow listall --open --print_json | jq length

ListOpenWorkflowExecutionsRequest request = ListOpenWorkflowExecutionsRequest
  .newBuilder()
  .setNamespace(client.getOptions().getNamespace())
  .build();

ListOpenWorkflowExecutionsResponse response = service.blockingStub()
  .listOpenWorkflowExecutions(request);

return response.getExecutionsCount();

I tried using CountWorkflowExecutionsRequest but I run into an error. It looks like I need ElasticSearch in order to get the result.

CountWorkflowExecutionsRequest countWorkflowExecutionsRequest = CountWorkflowExecutionsRequest
  .newBuilder()
  .setNamespace(client.getOptions().getNamespace())
  .build();
CountWorkflowExecutionsResponse response = service.blockingStub()
  .countWorkflowExecutions(countWorkflowExecutionsRequest);
return response.getCount();

Is there a way to get all the open workflows just by using ListOpenWorkflowExecutionsResponse or do we need to setup Elastic Search on our Temporal cluster? Thanks in advance

The problem I run into is that it always return 1000 which I think is the limit.

List* apis are paginated and you are probably not iterating through all pages (iirc default items in a page is 1000 indeed). See here for an example on how to deal with pagination in List apis (same would apply for ListOpenWorkflowExecutions its paginated as well).

Thanks for the answer @tihomir

Hi @tihomir , I followed your suggestion and managed to get the e2e latency. I run into an issue with the values I’m getting though. It looks like the metric query only get the e2e latency metric on the same virtual machine where it’s running. What I want is to know the e2e latency across all virtual machines on my instance group.

Here is how I’m getting the metric data from temporal atm.

Search search = registry.find("temporal_workflow_endtoend_latency");

// Get timers
Collection<Timer> timers = search.timers();

// Sum the results
Optional<Double> totalTime = timers.stream().map(t -> t.mean(TimeUnit.SECONDS))
        .reduce(Double::sum);

if (!totalTime.isPresent()) {
    throw new RuntimeException("No total time returned.");
}

return (float) (totalTime.get() / timers.size());

Please let me know what I’m doing wrong or if I’m using the wrong metric/functions. Thanks

For the registry, I have:

StatsdConfig config = new StatsdConfig() {
    @Override
    public String get(@NotNull String k) {
        return null;
    }

    @Override
    public @NotNull StatsdFlavor flavor() {
        return StatsdFlavor.DATADOG;
    }
};

MeterRegistry registry = StatsdMeterRegistry(config, Clock.SYSTEM);

It looks like the metric query only get the e2e latency metric on the same virtual machine where it’s running

I think you need to collect (scrape) metrics from all of your deployed worker processes and then accumulate (kinda what prometheus does). Is that possible in your scenario?

or if I’m using the wrong metric/functions

temporal_workflow_endtoend_latency is the correct SDK metric to use. Could you show how its provided when using statsd? Prometheus format would create buckets for this metric.

Thanks for the response. Here is how we have it setup:

Scope scope = new RootScopeBuilder()
                .prefix(rootPrefix)
                .reporter(new MicrometerClientStatsReporter(registry))
                .reportEvery(com.uber.m3.util.Duration.ofSeconds(1));

WorkflowServiceStubsOptions stubOptions = WorkflowServiceStubsOptions.newBuilder()
        .setMetricsScope(scope)
        .setTarget(clusterFrontendAddress)
        .build();

WorkflowServiceStubs service = WorkflowServiceStubs.newInstance(stubOptions);
WorkflowClient workflowClient = WorkflowClient.newInstance(service);

So far, we are seeing the stats in our Datadog Dashboard with no problem