Following is my temporal setup on GKE single node cluster.
Temporal cluster pointing to REMOTE Cassandra DB(TLS)
NoHistoryShards changed to 32
Match read/write partitions - 8
maxConns(Temporal): 200
Activity pollers: 50
Workflow pollers: 50
Frontend.NamespaceCount: 3000
MAX_CONC_WRK_TASK_EXEC_SIZE: 200
MAX_CONC_ACT_EXEC_SIZE: 200
My workflow consist of 3 activities which makes http calls to microservices present in same cluster(assume latency of 40ms)
Following were my readings for while running a load of 50 TPS for 1 min:
avg=235.59ms min=215.6ms med=218.31ms max=520.25ms p(90)=317.25ms p(95)=319.59ms
Following were my readings when i increase the TPS to 60:
avg=3.09s min=215.34ms med=518.56ms max=11.06s p(90)=10.23s p(95)=10.25s
Following are the metric graphs captured for 60 TPS:
As you can see temporal_workflow_task_schedule_to_start_latency_seconds_bucket, temporal_workflow_task_execution_latency_seconds_sum, temporal_workflow_endtoend_latency_seconds_bucket, temporal_activity_schedule_to_start_latency_seconds_sum, temporal_activity_execution_latency_seconds_sum are very low. But still as you see in web ui, all the workflow tasks are getting timed out. Can you please help?
ideally it should be above 99%. if sync match rate is low it would mean your workers are unable to keep up (need to increase worker capacity)
another thing to look at is sdk task_schedule_to_start_latency metric, can you measure this latency as well? a high latency would indicate to add more workers.
NoHistoryShards changed to 32
i think this is too low, typically you would go with 512 for a small scale setup. for prod setup would start with 4K.
Another thing to look at are persistence latencies:
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))
for operations: CreateWorkflowExecution, UpdateWorkflowExecution, UpdateShard
Following is my temporal setup on GKE single node cluster.
How man instances of temporal services are you running on your test env? See here for recommendations for a prod setup
Yes Tihomir, we have gone through this link and we had tried for different configs, but whenever we go above 50 tps, we start to see very high latency.
When you say “add more workers/ increase worker capacity”, does that mean increasing concurrent pollers/ increasing execution size/ increasing pods of application where we create workers or something else?
We had tried this with 512 shards, and had observed low latency with default matching partitions, but as we try the same run with 8 matching service partitions, the latency again increases. We want to understand why this is happening.
We have gone through this link as well. Currently there are only 1 pod per temporal service. We wanted to see with this deployment how much TPS can be achieved.
@tihomir Also, some followup questions regarding above setup.
We were able to achieve decent response time of workflow execution till 60 tps…but it goes very worse above that despite of changing no of shards, increasing temporal services pods, increasing no. of partitions. No luck yet.
Another thing, looks like when it goes above 60 tps…workflow active thread count drastically increases, sync match rate drops and the service latency that invokes async workflow execution increases. Is there any parameter to check the latency of a workflow invocation?
@Wenquan_Xing We do have one pod each per temporal service i.e frontend, history, matching etc
MaxConns is defined as 200…it is the value provided as part of yaml file for default and visibility databases…any recommendation you have for this value?
We are trying to hit workflows around 70 TPS…our target is to achieve more than 3000 TPS.
We are trying to hit workflows around 70 TPS…our target is to achieve more than 3000 TPS.
TPS: what is your definition? 70 workflow completion per second?
a workflow with 3 sequential activities roughly translate to 15 DB transactions
you may want to check if your DB is overloaded already