High Activity Latency

We have a temporal workflow that processes a payload in a sequential set of steps. The activities were taking longer to process than we expected. To capture some times we created another simple workflow that ran 5 Activities sequentially where each activity was just an empty method.

Running the workflow, times ranging between 200 - 450ms to process each activity were recorded. Given that each activity does nothing this time must be the time temporal takes to initiate the activity and process the result.

Temporal is running in a non-production kubernetes and is connected to a 3 node cassandra cluster also running in kubernetes. This cassandra cluster is dedicated to temporal. When the test was run there were no other workflows running. Here is a sample of the logs with the timings (the internal is the actual time spent in the activity - zero as it is empty, and the total which is the time recorded from the workflow to call the activity)

2021-03-20 14:27:48 DEBUG LatencyWorkflowImpl:145 workflow-method - a1 internal = 0, total = 334, a2 internal = 0, total = 271, a3 internal = 0, total = 193, a4 internal = 0, total = 319, a5 internal = 0, total = 226
2021-03-20 14:28:10 DEBUG LatencyWorkflowImpl:145 workflow-method - a1 internal = 0, total = 228, a2 internal = 0, total = 192, a3 internal = 0, total = 163, a4 internal = 0, total = 190, a5 internal = 0, total = 242
2021-03-20 14:28:17 DEBUG LatencyWorkflowImpl:145 workflow-method - a1 internal = 0, total = 447, a2 internal = 0, total = 256, a3 internal = 0, total = 478, a4 internal = 0, total = 284, a5 internal = 0, total = 450
2021-03-20 14:28:24 DEBUG LatencyWorkflowImpl:145 workflow-method - a1 internal = 0, total = 443, a2 internal = 0, total = 231, a3 internal = 0, total = 358, a4 internal = 0, total = 474, a5 internal = 0, total = 144
2021-03-20 14:28:30 DEBUG LatencyWorkflowImpl:145 workflow-method - a1 internal = 0, total = 288, a2 internal = 0, total = 173, a3 internal = 0, total = 210, a4 internal = 0, total = 220, a5 internal = 0, total = 271

Could anyone suggest why this latency is so high? Are there specific settings in temporal or cassandra that can be optimized to improve this performance. In our workflows this latency is 5-6 times more than the time the Activities spend processing the actual requests.

You need to have metrics reporting from temporal service enabled to be able to troubleshoot the bottleneck.

Thanks maxim - I’ll look into enabling the metrics.