Performance test on GKE

Hi,

Following is our setup on Google Kubernetes Engine (GKE):
2 GKE Clusters:

  1. Yugabyte DB
  2. Microservices (including temporal workflow and activity workers) +Temporal cluster - 9 microservices which form a whole application – 1 pod each. Both Temporal and microservices use the Yugabyte DB.

GKE Cluster info:

  1. Number of nodes: 3
  2. Memory available per node:27 Gb

Temporal Cluster – 1 pod each for frontend, matching and worker service. 6 pods for history service. NumHistoryShards = 2048
Worker configurations (1 worker and queue per workflow/activity):
maxConcurrentActivityTaskPollers: 100
maxConcurrentWorkflowTaskPollers: 100
maxConcurrentActivityExecutionSize:200
maxConcurrentWorkflowTaskExecutionSize: 200

Workflow details:

  1. 2 workflows have 1 local activity and 2 normal activities each.
  2. 1 workflow having 1 local activity and 7 normal activities.

All normal activities have calls to in jvm service. Some of these services have inter microservice communication and DB calls.

On an avg, each microservice takes around 20-40 ms.

With the above configuration, we ran 1000 iterations sequentially with avg workflow execution time being 1.5-2 sec. There seems to be a huge latency (schedule to start) for workflow and activity workers compared to the actual microservice execution timings.

Metrics observed:

  1. Workflow task schedule to start latency:

  2. Activity schedule to start latency:

  3. Tracing metrics:


    As per above image, workflow starts at 25.36 ms and run workflow happens at 2566.36 ms, which indicates high schedule to start latency.

High workflow/activity task schedule to start latency(500-1500ms) is adding to the overall execution time of the workflow despite having all the above configurations. Can you please give some advice on how to improve the performance and look for possible bottlenecks? And also, any further tuning parameters for temporal services.

Take a look at this post for info on more server and sdk metrics that can help tune your workers (worker tuning guide in docs), as well as here and here for recommendations for load testing/ prod setup.

From the described latencies it seems the bottleneck is related to worker capacity.

1 Like

Will check on these points @tihomir