History instance CPU cannot scale up

Hi All,

Got a scale up issue here with history instance, using the 2 history instances (while FE, matching, worker are all 1 instance) the CPU utilization is around 3 CPU each (total around 6 CPU), but when increase to 3 history instance the total CPU utilization is still around 6 CPU (so around 2 CPU for each history instance). The temporal instances are running in docker on VM so there should not be limit to CPU (can max out to host VM which is still plenty).

What are the possible cause the total History CPU cannot be scale up? Tried to increase history shard from 4096 to 8192 does not have much effect. Also increase the partition from 8 to 16 even makes the performance worse. Also tried to increase workflow workers and pollers also not really help. The DB is cassandra and don’t suspect the bottleneck is in cassandra.

Any thoughts?

Thanks
-ridwan-

adding monitoring metric here


there are delays in activity started and child flow execution completed:


Your sync match rate should ideally be over 99%, the dip shown is concerning. This typically indicates you need more workers (increase capacity), see worker tuning guide here.

On the SDK metrics side did you have a chance to look at workflow_task_schedule_to_start_latency and activity_schedule_to_start_latency metrics during this time?

Also on persistence latencies can you focus on operations CreateWorkflowExecution, UpdateWorkflowExecution and UpdateShard, little hard to see from picture

1 Like