Auto scaling worker deployment

mmoya · March 28, 2024, 5:02pm

Hello, I created a chart for deploying a temporal worker using this guide and I’m currently considering a HPA on that deployment and defining

my_worker = Worker(
        client,
        task_queue="some-task-queue-str",
        activities=["some-activity-defn"],
        max_concurrent_activities=5,
    )

via the application’s worker
However, as far as the deployment itself, the HPA is currently only scaling on memory (or CPU utilization) so it looks something like

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: some-temporal-name
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: some-temporal-name
  minReplicas: 1
  maxReplicas: 5
  metrics:
    {{- if .Values.autoscaling.CPUUtilization }}
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    {{- end }}
    {{- if .Values.autoscaling.MemoryUtilization }}
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70
    {{- end }}

are there any other recommended metrics to consider for autoscaling our activities besides the pod’s memory usage?

Chad_Retz · March 29, 2024, 12:32pm

Many people use “available slots” to know when to scale workers. See Developer's guide - Worker performance | Temporal Documentation. You would have benchmarked your workers and set their max-concurrent-activities set to a number you know the resources can handle, then you would watch if available slots get too low.

mmoya · March 31, 2024, 6:13pm

thank you for the reply! That sounds great! As far as the python sdk for temporal, how do I evoke metrics like temporal_worker_task_slots_available or other temporal_ metrics in my activity/worker? Is there a python example of this?

are those default metrics that arise anytime I consider a metrics runtime(in this case, otel) like

from temporalio.contrib.opentelemetry import TracingInterceptor
from temporalio.runtime import OpenTelemetryConfig, Runtime, TelemetryConfig

runtime = init_runtime_with_telemetry()

as far as my client

    # Connect client
    client = await Client.connect(
        client,
        # Use OpenTelemetry interceptor
        interceptors=[TracingInterceptor()],
        runtime=runtime,
    )

or are these metrics that need to be defined in some way?

Chad_Retz · April 1, 2024, 1:31pm

Yes. Here is a sample of configuring OTel metrics on the runtime (which you only need to create one of and use when you’re creating your client). The interceptor is for tracing by the way, only the runtime needs to be configured for metrics.

mmoya · April 3, 2024, 6:56pm

awesome, so I tried configuring our otel collector and our temporal client to consider the otel runtime but I’m unable to get the temporal_ metrics at least for my remote worker. Is there anything else I would need to consider for a remote worker/activity? I also noticed, for the activites that were not remote, it was writing temporal_ metrics to a service name called temporal-core-sdk despite defining a different service name as far as the provider samples-python/open_telemetry/worker.py at 4303a9b15f4ddc4cd770bc0ba33afef90a25d3ae · temporalio/samples-python · GitHub

Chad_Retz · April 4, 2024, 12:25pm

No, this should work without issue if you give it an OTLP gRPC endpoint.

That link is to a trace provider and is unrelated to metrics. When configuring TelemetryConfig for the metrics, you can set attach_service_name to overwrite the default of temporal-core-sdk.

mmoya · April 4, 2024, 2:18pm

@Chad_Retz thank you for the reply! I’m a little stumped as to why I’m able to send custom metrics yet unable to send the default temporal_ metrics to our otel instrumentator+dd via our remote worker. Would you suggest revisiting the configs?

As far as attach_service_name would setting it to True overwrite the service_name from temporal-core-sdk to the service name we provide? Or would we need to set it to False? (I noticed it seems to be set to True by default sdk-python/temporalio/runtime.py at main · temporalio/sdk-python · GitHub)

Chad_Retz · April 4, 2024, 3:09pm

Yes, make sure that the endpoint you are giving as metrics=OpenTelemetryConfig(url="http://whatever") accepts OTel metrics, and that you are creating that one global runtime and using it across all clients you create.

It is defaulted to True which means the SDK attaches the service name. You set it to False to stop setting it. You can set global_tags to set any tags for all metrics (including service_name after setting attach to false).

mmoya · April 4, 2024, 3:58pm

Yes, make sure that the endpoint you are giving as metrics=OpenTelemetryConfig(url="http://whatever") accepts OTel metrics, and that you are creating that one global runtime and using it across all clients you create.

Thank you for the reply. I’m considering a global runtime and I’m able to write custom metrics using that endpoint(our endpoint is http://localhost:4317 since our app/worker also contains a otel sidecar upon deployment). It’s just not writing the default metrics. As a workaround, is there a way to import/consider default metrics as a variable so that I can manually add it as a metric in otel as I would a custom metric? Would love to be able to import temporal_worker_task_slots_available in some way and manually record that metric into otel. Or I was wondering if there was source code where temporal_worker_task_slots_available is computed?

It is defaulted to True which means the SDK attaches the service name. You set it to False to stop setting it. You can set global_tags to set any tags for all metrics (including service_name after setting attach to false).

Just to triple check I’m understanding this correctly, if I set to attach_service_name=False

Runtime(
        telemetry=TelemetryConfig(
            metrics=OpenTelemetryConfig(url="http://localhost:4317"),
            attach_service_name=False, 
            global_tags={"service_name": "my-service"}
        )
    )

then it would use the same service_name as the service name for the provider

provider = TracerProvider(resource=Resource.create({SERVICE_NAME: "my-service"}))

in this case "my-service"? Thus, if we set attach_service_name=False those metrics would now be under service_name:my-service instead of service_name:temporal-core-sdk?

Chad_Retz · April 4, 2024, 10:39pm

Can you clarify “write custom metrics”? Is this using my_runtime.metric_meter().[whatever] or some other tool? Does directly making metrics on the meter work? Are you sure you are setting this runtime as the client options?

This is deep in the Rust core and consuming the metric via the metric system may be the best way. Python does have the ability to now use a metric buffer where you can manually consume metrics instead of exposing via otel/prometheus. We don’t have a sample yet, but see this test.

This is confusing “tracing” and “metrics”. global_tags applies to metrics and is unrelated to TracerProvider, and yes if you set it with service_name tag it should be on every metric.

Topic		Replies	Views
Suggested metrics to autoscale Temporal workers on Community Support general-impl , metrics , kubernetes	9	8029	January 3, 2024
Autoscaling Workers Based on Custom Prom Metrics, For one specific activity in the queue Community Support worker , deployment	1	298	March 21, 2024
Which metric should be used for HPA activity workers in Kubernetes? Community Support php-sdk	2	1292	February 19, 2022
What are the best metrics to autoscale each cluster service on? Community Support	5	910	May 8, 2023
Clarification on metrics (client + server) Community Support java-sdk , metrics	14	2668	April 13, 2022

Auto scaling worker deployment

Related topics