Hi,
We are running a load test against Mysql-based temporal server. I am able to achieve around 1k/s workflow task throughput. With more load, I start to see the context exceed error log piling up in Temporal server when MySQL is at about 50% CPU utilization. I wonder if there is headroom for me to further optimize, or we have hit the capacity limit of MySQL? Thanks!
And we are seeting lots of server errors like this
Blockquote
{“level”:“error”,“ts”:“2021-01-26T21:48:49.420Z”,“msg”:“Operation failed with internal error.”,“service”:“history”,“error”:“AppendHistoryEvents: context deadline exceeded”,“metric-scope”:222,“logging-call-at”:“persistenceMetricClients.go:1235”,“stacktrace”:“go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error \n\t/temporal/common/log/loggerimpl/logger.go:138\ngo.temporal.io/server/common/persistence.(*historyV2PersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:1235\ngo.temporal.io/server/common/persistence.(*historyV2PersistenceClient).AppendHistoryNodes\n\t/temporal/common/persistence/persistenceMetricClients.go:1134\ngo.temporal.io/server/service/history/shard.(*ContextImpl).AppendHistoryV2Events\n\t/temporal/service/history/shard/context_impl.go:769\ngo.temporal.io/server/service/history.(*workflowExecutionContextImpl).appendHistoryV2EventsWithRetry.func1\n\t/temporal/service/history/workflowExecutionContext.go:945\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*workflowExecutionContextImpl).appendHistoryV2EventsWithRetry\n\t/temporal/service/history/workflowExecutionContext.go:949\ngo.temporal.io/server/service/history.(*workflowExecutionContextImpl).persistNonFirstWorkflowEvents\n\t/temporal/service/history/workflowExecutionContext.go:923\ngo.temporal.io/server/service/history.(*workflowExecutionContextImpl).updateWorkflowExecutionWithNew\n\t/temporal/service/history/workflowExecutionContext.go:693\ngo.temporal.io/server/service/history.(*workflowExecutionContextImpl).updateWorkflowExecutionAsActive\n\t/temporal/service/history/workflowExecutionContext.go:601\ngo.temporal.io/server/service/history.(*historyEngineImpl).updateWorkflowHelper\n\t/temporal/service/history/historyEngine.go:2413\ngo.temporal.io/server/service/history.(*historyEngineImpl).updateWorkflow\n\t/temporal/service/history/historyEngine.go:2351\ngo.temporal.io/server/service/history.(*historyEngineImpl).SignalWorkflowExecution\n\t/temporal/service/history/historyEngine.go:1780\ngo.temporal.io/server/service/history.(*Handler).SignalWorkflowExecution\n\t/temporal/service/history/handler.go:956\ngo.temporal.io/server/api/historyservice/v1._HistoryService_SignalWorkflowExecution_Handler.func1\n\t/temporal/api/historyservice/v1/service.pb.go:1073\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:100\ngo.temporal.io/server/api/historyservice/v1._HistoryService_SignalWorkflowExecution_Handler\n\t/temporal/api/historyservice/v1/service.pb.go:1075\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.33.2/server.go:1210\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.33.2/server.go:1533\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.33.2/server.go:871”}
This is my cluster setup params:
env:
- name: NUM_HISTORY_SHARDS
value: “1024”
- name: DB
value: mysql
- name: DYNAMIC_CONFIG_FILE_PATH
value: config/dynamicconfig/development.yaml
- name: MYSQL_PWD
valueFrom:
secretKeyRef:
key: password
name: cloudsql-db-credentials
- name: MYSQL_SEEDS
value: 127.0.0.1
- name: MYSQL_USER
valueFrom:
secretKeyRef:
key: username
name: cloudsql-db-credentials
- name: SQL_MAX_CONNS
value: “400”
- name: SQL_MAX_IDLE_CONNS
value: “200”
- name: SQL_MAX_CONN_TIME
value: 1h
- name: MYSQL_TX_ISOLATION_COMPAT
value: “true”
- name: STATSD_ENDPOINT
value: 127.0.0.1:8125
- name: BIND_ON_IP
value: 0.0.0.0
- name: TEMPORAL_BROADCAST_ADDRESS
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
data:
dynamic_config.yaml: |-
frontend.enableClientVersionCheck:
value: true
constraints: {}
system.minRetentionDays:
value: 0
constraints: {}
history.persistenceMaxQPS:
value: 35000
constraints: {}
frontend.persistenceMaxQPS:
value: 30000
constraints: {}
frontend.historyMgrNumConns:
value: 100
constraints: {}
frontend.throttledLogRPS:
value: 20
constraints: {}
history.historyMgrNumConns:
value: 500
constraints: {}
system.advancedVisibilityWritingMode:
value: “off”
constraints: {}
matching.numTaskqueueReadPartitions:
value: 64
constraints: {}
matching.numTaskqueueWritePartitions:
value: 64
constraints: {}
matching.maxTaskBatchSize:
value: 1000
constraints: {}
history.defaultActivityRetryPolicy:
value:
InitialIntervalInSeconds: 1
MaximumIntervalCoefficient: 100.0
BackoffCoefficient: 2.0
MaximumAttempts: 0
history.defaultWorkflowRetryPolicy:
value:
InitialIntervalInSeconds: 1
MaximumIntervalCoefficient: 100.0
BackoffCoefficient: 2.0
MaximumAttempts: 0
Can you plz provide the following?
General
MySQL: CPU utilization / CPU load / mem utilization / IO Capacity / IO utilization
Temporal: frontend / matching history server CPU utilization / CPU load / mem utilization
Persistence metrics: persistence_latency
Matching service metrics: poll_success_per_tl
and poll_success_sync_per_tl
Minghan_Fu:
matching.numTaskqueueReadPartitions:
value: 64
constraints: {}
matching.numTaskqueueWritePartitions:
value: 64
these 2 values maybe too high BTW