Temporal performance with golang microservice, Cassandra & Elasticsearch

Hi,

Currently we are using temporal in a project, however we are not able to meet the performance required.
Kindly need your help to check whether we are using the right settings/recommendation.
Here are our case:

  • We are using go. Just 1 microservice with 18 activities. Each activity only do simple http call to backend with 100ms delay response (no complex logic).

  • We run the go microservice and temporal servers in AWS EKS. The temporal server services use docker image temporalio/auto-setup:1.16.2

  • We run 4 temporal server services each 1 pod. temporal-server-frontend, temporal-server-history, temporal-server-matching, and temporal-server-worker.

  • We are using cassandra and elasticsearch with 3 replicas.

  • We already did performance test and here are the result. The performance requirement is about 200 workflows/sec.

    • 100 workflows completed in 4 seconds. ~25 workflows/sec
    • 1000 workflows completed in 1 minute 6 seconds. ~15 workflows/sec
    • 5000 workflows completed in 6 minutes 23 seconds. ~13 wokflows/sec.

Here are the configs we use:

  • go microservice.
worker.Options{
...
		MaxConcurrentActivityTaskPollers:        40,
		MaxConcurrentWorkflowTaskPollers:        40,
		MaxConcurrentSessionExecutionSize:       8196,
		MaxConcurrentWorkflowTaskExecutionSize:  1024,
		MaxConcurrentActivityExecutionSize:      4096,
		MaxConcurrentLocalActivityExecutionSize: 2048,
		WorkerActivitiesPerSecond:               256000,
...
}
  • temporal-server-frontend, temporal-server-history, temporal-server-matching, temporal-server-worker.
    • dynamic config yaml
matching.numTaskqueueReadPartitions:
- value: 15
  constraints: {}
matching.numTaskqueueWritePartitions:
- value: 15
  constraints: {}
    • environment variable in EKS
containers:
      - env:
        - name: SERVICES
          value: frontend (history,matching, or worker for each service)
        - name: CASSANDRA_PORT
          value: "9042"
        - name: CASSANDRA_SEEDS
          value: xx.xx.xx.xx,xx.xx.xx.xx,xx.xx.xx.xx
        - name: DB
          value: cassandra
        - name: KEYSPACE
          value: temporal
        - name: VISIBILITY_KEYSPACE
          value: temporal_visibility
        - name: ENABLE_ES
          value: "true"
        - name: ES_SCHEME
          value: http
        - name: ES_SEEDS
          value: xx.xx.xx.xx
        - name: ES_PORT
          value: "9200"
        - name: ES_VERSION
          value: v7
        - name: ES_VIS_INDEX
          value: temporal_visibility_v1_esb
        - name: NUM_HISTORY_SHARDS
          value: "512"
        - name: CASSANDRA_REPLICATION_FACTOR
          value: "3"
        - name: LOG_LEVEL
          value: debug,info
        - name: CASSANDRA_MAX_CONNS
          value: "64"

My questions :

*Since we are not reaching the 200 workflows/sec, what configurations do we need to change?

Here are pprof for go microservice for the reference:

Do you have both SDK and server metrics set up? It would help figure out latencies, especially on your worker side, see here for more info.

  • name: NUM_HISTORY_SHARDS
    value: “512”

This seems to use the default 512 and is often too low, see here for general recommendations.

The temporal server services use docker image temporalio/auto-setup:1.16.2

Don’t think auto-setup is recommended for production deployments, see here for more info.

Don’t think auto-setup is recommended for production deployments, see here for more info.

Hi Tihomir,
Is there any performance drawback if we use auto-setup docker image?
We use auto-setup to simplify our POC (Proof of Concept).

Hi Tihomir,

I already tried using temporaio/server:1.16.2, and setup cassandra & elastic myself (create tables & indexes).
But I got error below.
“error”:“Unable to decode search attributes: invalid search attribute type: Unspecified”

{"level":"error","ts":"2022-05-24T09:07:13.104Z","msg":"Fail to process task","shard-id":1101,"address":"10.1.0.101:7234","component":"visibility-queue-processor","wf-namespace-id":"bb11fa9a-42d3-40fc-9219-1c0a705ddda2","wf-id":"tx1_server_test2","wf-run-id":"e7a9443f-a782-41ea-8d90-d88bfb43b86c","queue-task-id":1048579,"queue-task-visibility-timestamp":"2022-05-24T09:07:10.209Z","queue-task-type":"VisibilityStartExecution","queue-task":{"NamespaceID":"bb11fa9a-42d3-40fc-9219-1c0a705ddda2","WorkflowID":"tx1_server_test2","RunID":"e7a9443f-a782-41ea-8d90-d88bfb43b86c","VisibilityTimestamp":"2022-05-24T09:07:10.209487532Z","TaskID":1048579,"Version":0},"wf-history-event-id":0,"error":"Unable to decode search attributes: invalid search attribute type: Unspecified","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:333","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/home/builder/temporal/service/history/taskProcessor.go:333\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/home/builder/temporal/service/history/taskProcessor.go:221\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/home/builder/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/home/builder/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/home/builder/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/home/builder/temporal/service/history/taskProcessor.go:247\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/home/builder/temporal/service/history/taskProcessor.go:177"}

and I used workaround, scaleup temporalio/auto-setup:1.16.2 first, and rollback again using temporalio/server:1.16.2 got no errors.

Changes I already test based on your recommendation:

  • Using image temporalio/server:1.16.2 in all service.
  • Using NUM_HISTORY_SHARDS=4096.
  • scaleup golang sdk service to 4 pods.

But still no impact in performance.

  • 1000 workflows still completed in 1 minutes.

We noticed that request operation is flatten around 250 - 280 ops/s and there is also huge gap between execution started to task started in temporal UI monitoring. Is this related with limited throughput outcome from the workflow worker? Is there any more configuration that can be configured to improve the throughput?



Is there any more configuration that can be configured to improve the throughput?

One thing you can check is your persistence latencies. Try Prometheus query:

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))
for CreateWorkflowExecution, UpdateWorkflowExecution, UpdateShard operations. You will have to establish your own base line for these latencies, but is a good thing to look at.

Another thing to consider is worker capacity (number of pollers).
See this post and our worker tuning guide for more info.

Regarding shard count, try looking at shard lock contention:

histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))

if this latency is too high indicates that your did not set enough shards.

How many frontend servers do you have handling client requests? Typically you want to have a L7 load balancer to round-robin client load among available frontend servers.

Another thing you could look at are resource exhausted issues.
Prometheus query:
sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)

Look for error causes: RpsLimit, ConcurrentLimit, and SystemOverloaded. These are per service instance.

Thank you for the insight, Tihomir.
I already checked at resource exhausted, and found out there were RpsLimit in matching service.

I did change the rps value in dynamic value and make some improvements,
but sometimes there were another RpsLimit occurred in frontend service.

Here are the latest performance result and my configurations:

  • 1000 workflows completed in 25 seconds. around 40 workflows/sec.
    Improved from 10 workflows/sec.

  • 1 pod golang SDK service, 1 pod temporal-frontend, 1 pod temporal-history, 1 pod temporal-matching, and 1 pod temporal-worker.

  • golang configuration.

worker.Options{
...
		MaxConcurrentActivityTaskPollers:        32,
		MaxConcurrentWorkflowTaskPollers:        32,
		MaxConcurrentSessionExecutionSize:       64,
		MaxConcurrentWorkflowTaskExecutionSize:  64,
		MaxConcurrentActivityExecutionSize:      64,
		MaxConcurrentLocalActivityExecutionSize: 64,
		WorkerActivitiesPerSecond:               256000,
...
}
  • temporal server services and dynamic configuration.
    NUM_HISTORY_SHARDS=4096
matching.numTaskqueueReadPartitions:
- value: 1
  constraints: {}
matching.numTaskqueueWritePartitions:
- value: 1
  constraints: {}
matching.rps:
- value: 51200
  constraints: {}
frontend.rps:
- value: 19200
  constraints: {}
history.rps:
- value: 64000
  constraints: {}

From this exploration, we noticed that this tightly related between RPS limit, activity/task poller, and matching task queue partition.
However we still in a blindspot on figuring out what is the optimized settings to fulfill our needs. Is there any rule of thumbs to tuning up those parameters?

Here are some questions popped out that need some clarification or guidance, in order to tune temporal in our environment better.

  1. How to correlate RPS limit between frontend, matching, and history? In our experiment, seems matching needs highest RPS while history is the lowest.
  2. Is RPS limit in our experiment is considered too high (51200 RPS)?
  3. How many activity/task poller is considered too many?
  4. What is the impact of setting too high RPS number? Same question for partition and poller.
  5. How to determine how many partitions we need? Because we tried to increase queue partition, but our process become slower.

Have you considered using local activities for short calls?

  1. try to use 3 frontend pods; 3 matching pods; 3 history pods; 1 worker pods;
  2. check CPU / mem utilization of above pods when running the load
  3. make sum(rate(poll_success_sync[1m]))/sum(rate(poll_success[1m])) as close to 100% as possible by increasing the MaxConcurrentActivityTaskPollers and MaxConcurrentWorkflowTaskPollers
  4. use 4 for both matching.numTaskqueueReadPartitions and matching.numTaskqueueWritePartitions
  5. try to use local activity, which will improve the latency (reduce unnecessary round trip)
  6. assuming my understanding is correct: per workflow there will be 18 normal activities, target is 100 workflow per sec: ← translates to roughly 5K DB transaction
    • check if your DB is overloaded (DB CPU / mem)
    • check the metrics emitted from temporal persistence layer histogram_quantile(0.99, sum(rate(persistence_latency_bucket[1m])) by (operation, le)) ← p99 latency, target is 50ms ish (create workflow execution, update workflow execution)
  7. try to use more workflow for testing, 1000 workflow start to completion rate may not be accurate (long tail?), try 10,000 or maybe more
1 Like

@maxim
Have you considered using local activities for short calls?

Hi Maxim,

We haven’t considered local activity because initially eventhough the activities are short calls, we want to leverage temporal retry mechanism to keep simplicity in our workflow code. What I read during using local activity, we need to implement retry by ourselves and need to consider workflow timeout boundary for that particular activity.

@Wenquan_Xing

  1. try to use 3 frontend pods; 3 matching pods; 3 history pods; 1 worker pods;

Hi Wen Quan,

Thank you for the setup recommendation, will try out this settings + 10k workflows and revert back to this forum once it’s done.

  1. make sum(rate(poll_success_sync[1m]))/sum(rate(poll_success[1m])) as close to 100% as possible by increasing the MaxConcurrentActivityTaskPollers and MaxConcurrentWorkflowTaskPollers

For poll success sync and poll success ratio, we did try to increase the poller settings and it degrades the performance, we suspect this is due to RPS limit configuration in temporal server’s services. Could please enlighten us how to correlate RPS limit between termporal services, so we can figure out better how to setup RPS limit? Is there any documentation to tuning temporal service via dynamic config?

  1. assuming my understanding is correct: per workflow there will be 18 normal activities, target is 100 workflow per sec: ← translates to roughly 5K DB transaction

Last question from your feedback, how do you estimate 1 workflow, 18 normal activities, with 100 workflow per second could be translated to 5k DB transaction per second?

This is not correct. Local activities do support retries and their retry options can be changed through LocalActivityOptions.

1 Like

Hi @maxim
Please help me with this related topic, thanks

I have the same question as Hilman Adam

Could please enlighten us how to correlate RPS limit between termporal services, so we can figure out better how to setup RPS limit? Is there any documentation to tuning temporal service via dynamic config?

Hi Wen Quan,

  • check the metrics emitted from temporal persistence layer histogram_quantile(0.99, sum(rate(persistence_latency_bucket[1m])) by (operation, le)) ← p99 latency, target is 50ms ish (create workflow execution, update workflow execution)

What’s the unit for this metrics/ expression query? is it seconds or milliseconds?

Hi @Hilman_Adam answered in this post

1 Like