Temporal server loadtest

Hi,

We are running a load test against Mysql-based temporal server. I am able to achieve around 1k/s workflow task throughput. With more load, I start to see the context exceed error log piling up in Temporal server when MySQL is at about 50% CPU utilization. I wonder if there is headroom for me to further optimize, or we have hit the capacity limit of MySQL? Thanks!

And we are seeting lots of server errors like this

Blockquote
{“level”:“error”,“ts”:“2021-01-26T21:48:49.420Z”,“msg”:“Operation failed with internal error.”,“service”:“history”,“error”:“AppendHistoryEvents: context deadline exceeded”,“metric-scope”:222,“logging-call-at”:“persistenceMetricClients.go:1235”,“stacktrace”:“go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\ngo.temporal.io/server/common/persistence.(*historyV2PersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:1235\ngo.temporal.io/server/common/persistence.(*historyV2PersistenceClient).AppendHistoryNodes\n\t/temporal/common/persistence/persistenceMetricClients.go:1134\ngo.temporal.io/server/service/history/shard.(*ContextImpl).AppendHistoryV2Events\n\t/temporal/service/history/shard/context_impl.go:769\ngo.temporal.io/server/service/history.(*workflowExecutionContextImpl).appendHistoryV2EventsWithRetry.func1\n\t/temporal/service/history/workflowExecutionContext.go:945\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*workflowExecutionContextImpl).appendHistoryV2EventsWithRetry\n\t/temporal/service/history/workflowExecutionContext.go:949\ngo.temporal.io/server/service/history.(*workflowExecutionContextImpl).persistNonFirstWorkflowEvents\n\t/temporal/service/history/workflowExecutionContext.go:923\ngo.temporal.io/server/service/history.(*workflowExecutionContextImpl).updateWorkflowExecutionWithNew\n\t/temporal/service/history/workflowExecutionContext.go:693\ngo.temporal.io/server/service/history.(*workflowExecutionContextImpl).updateWorkflowExecutionAsActive\n\t/temporal/service/history/workflowExecutionContext.go:601\ngo.temporal.io/server/service/history.(*historyEngineImpl).updateWorkflowHelper\n\t/temporal/service/history/historyEngine.go:2413\ngo.temporal.io/server/service/history.(*historyEngineImpl).updateWorkflow\n\t/temporal/service/history/historyEngine.go:2351\ngo.temporal.io/server/service/history.(*historyEngineImpl).SignalWorkflowExecution\n\t/temporal/service/history/historyEngine.go:1780\ngo.temporal.io/server/service/history.(*Handler).SignalWorkflowExecution\n\t/temporal/service/history/handler.go:956\ngo.temporal.io/server/api/historyservice/v1._HistoryService_SignalWorkflowExecution_Handler.func1\n\t/temporal/api/historyservice/v1/service.pb.go:1073\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:100\ngo.temporal.io/server/api/historyservice/v1._HistoryService_SignalWorkflowExecution_Handler\n\t/temporal/api/historyservice/v1/service.pb.go:1075\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.33.2/server.go:1210\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.33.2/server.go:1533\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.33.2/server.go:871”}

This is my cluster setup params:

  • env:
    - name: NUM_HISTORY_SHARDS
    value: “1024”
    - name: DB
    value: mysql
    - name: DYNAMIC_CONFIG_FILE_PATH
    value: config/dynamicconfig/development.yaml
    - name: MYSQL_PWD
    valueFrom:
    secretKeyRef:
    key: password
    name: cloudsql-db-credentials
    - name: MYSQL_SEEDS
    value: 127.0.0.1
    - name: MYSQL_USER
    valueFrom:
    secretKeyRef:
    key: username
    name: cloudsql-db-credentials
    - name: SQL_MAX_CONNS
    value: “400”
    - name: SQL_MAX_IDLE_CONNS
    value: “200”
    - name: SQL_MAX_CONN_TIME
    value: 1h
    - name: MYSQL_TX_ISOLATION_COMPAT
    value: “true”
    - name: STATSD_ENDPOINT
    value: 127.0.0.1:8125
    - name: BIND_ON_IP
    value: 0.0.0.0
    - name: TEMPORAL_BROADCAST_ADDRESS
    valueFrom:
    fieldRef:
    apiVersion: v1
    fieldPath: status.podIP
    data:
    dynamic_config.yaml: |-
    frontend.enableClientVersionCheck:
    • value: true
      constraints: {}
      system.minRetentionDays:
    • value: 0
      constraints: {}
      history.persistenceMaxQPS:
    • value: 35000
      constraints: {}
      frontend.persistenceMaxQPS:
    • value: 30000
      constraints: {}
      frontend.historyMgrNumConns:
    • value: 100
      constraints: {}
      frontend.throttledLogRPS:
    • value: 20
      constraints: {}
      history.historyMgrNumConns:
    • value: 500
      constraints: {}
      system.advancedVisibilityWritingMode:
    • value: “off”
      constraints: {}
      matching.numTaskqueueReadPartitions:
    • value: 64
      constraints: {}
      matching.numTaskqueueWritePartitions:
    • value: 64
      constraints: {}
      matching.maxTaskBatchSize:
    • value: 1000
      constraints: {}
      history.defaultActivityRetryPolicy:
    • value:
      InitialIntervalInSeconds: 1
      MaximumIntervalCoefficient: 100.0
      BackoffCoefficient: 2.0
      MaximumAttempts: 0
      history.defaultWorkflowRetryPolicy:
    • value:
      InitialIntervalInSeconds: 1
      MaximumIntervalCoefficient: 100.0
      BackoffCoefficient: 2.0
      MaximumAttempts: 0

Can you plz provide the following?

General
MySQL: CPU utilization / CPU load / mem utilization / IO Capacity / IO utilization
Temporal: frontend / matching history server CPU utilization / CPU load / mem utilization

Persistence metrics: persistence_latency
Matching service metrics: poll_success_per_tl and poll_success_sync_per_tl

these 2 values maybe too high BTW