Metrics For Temporal Performance Testing

Hi All,

I am setting up performance testing for my Temporal application and need guidance on selecting the right metrics. I followed this Temporal blog → Scaling Temporal: The Basics | Temporal on scaling, but I am not seeing similar metrics since I am running Temporal in a Docker-based on-prem setup.

For my testing, I plan to run hundreds of dummy workflows to analyze their behavior. While I have worker metrics being emitted from java sdk which run on “/actuator/prometheus”, I am struggling to identify the key metrics that effectively measure performance. Like from the scaling document it is mentioned that " state_transition_count_count" which I am unable to find, which helps me to understand the performance.

Are there specific metrics I should focus on for performance testing when done from java-sdk? Additionally, is there anything I need to configure differently to get a more accurate measurement of my Temporal application’s performance?

Let me know if any info required from my end.

Are you testing against on-prem cluster or Temporal Cloud?
For on-prem cluster you would measure state transitions per second, info on that and other important server metrics are in presentation here if it helps.

From worker metrics you would be looking at temporal_workflow_task_scheduled_to_start_latency, temporal_activity_schedule_to_start_latency foremost
but also can share graphs for
temporal_sticky_cache_size, temporal_worker_task_slots_available, temporal_request_failure, temporal_long_request_failure and your worker pod cpu and memory utilization (in % of max if possible).