Matching service high QPS on persistence

Hi,

I have some problem running temporal in production. I am noticing QPS on persistence layer is 10X from matching then it normally should be(without workflow running) and history service generate a lot Queue processor pump shut down and Error updating timer ack level for shard. I wonder what could be the potential issue here.

{“level”:“info”,“ts”:“2021-07-29T19:13:33.256Z”,“msg”:“Queue processor pump shut down.”,“service”:“history”,“shard-id”:162,“address”:“10.32.11.179:7934”,“shard-item”:“0xc018b0a400”,“component”:“visibility-queue-processor”,“logging-call-at”:“queueProcessor.go:248”}

{“level”:“info”,“ts”:“2021-07-29T19:13:33.256Z”,“msg”:“Task processor shutdown.”,“service”:“history”,“shard-id”:162,“address”:“10.32.11.179:7934”,“shard-item”:“0xc018b0a400”,“component”:“visibility-queue-processor”,“logging-call-at”:“taskProcessor.go:145”}

{“level”:“info”,“ts”:“2021-07-29T19:13:33.256Z”,“msg”:“none”,“service”:“history”,“shard-id”:162,“address”:“10.32.11.179:7934”,“shard-item”:“0xc018b0a400”,“component”:“visibility-queue-processor”,“lifecycle”:“Stopped”,“component”:“transfer-queue-processor”,“logging-call-at”:“queueProcessor.go:178”}

{“level”:“info”,“ts”:“2021-07-29T19:13:33.256Z”,“msg”:“none”,“service”:“history”,“shard-id”:162,“address”:“10.32.11.179:7934”,“shard-item”:“0xc018b0a400”,“component”:“history-engine”,“lifecycle”:“Stopped”,“logging-call-at”:“historyEngine.go:375”}

{“level”:“info”,“ts”:“2021-07-29T19:13:33.256Z”,“msg”:“none”,“service”:“history”,“shard-id”:162,“address”:“10.32.11.179:7934”,“shard-item”:“0xc018b0a400”,“lifecycle”:“Stopped”,“component”:“shard-engine”,“logging-call-at”:“controller_impl.go:462”}

{“level”:“info”,“ts”:“2021-07-29T19:13:33.264Z”,“msg”:“Close shard”,“service”:“history”,“shard-id”:1124,“address”:“10.32.11.179:7934”,“shard-item”:“0xc01346f680”,“logging-call-at”:“context_impl.go:807”}

{“level”:“error”,“ts”:“2021-07-29T19:13:33.264Z”,“msg”:“Error updating timer ack level for shard”,“service”:“history”,“shard-id”:1124,“address”:“10.32.11.179:7934”,“shard-item”:“0xc01346f680”,“component”:“timer-queue-processor”,“cluster-name”:“active”,“error”:“Failed to update shard. Previous range ID: 14395; new range ID: 14396”,“logging-call-at”:“timerQueueAckMgr.go:402”,“stacktrace”:“go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\ngo.temporal.io/server/service/history.(*timerQueueAckMgrImpl).updateAckLevel\n\t/temporal/service/history/timerQueueAckMgr.go:402\ngo.temporal.io/server/service/history.(*timerQueueProcessorBase).internalProcessor\n\t/temporal/service/history/timerQueueProcessorBase.go:319\ngo.temporal.io/server/service/history.(*timerQueueProcessorBase).processorPump\n\t/temporal/service/history/timerQueueProcessorBase.go:194”}

the error shown above only means some shard movement in history service

any metrics / logs from matching service?

I am seeing persistence high qps from temporal prom metrics persistence_requests.
Does the big range ID indicates that membership ring problem?
“error”:“Failed to update shard. Previous range ID: 14395; new range ID: 14396”,“

I do not see error msg from matching service, mostly just workflow life cycle logs.

first this is history service log, not matching

if this kind of error alway happens, then you need to check your network configs see if your service host can still talk to each other

and maybe you are encountering this issue?

Thanks for the pointer, I am seeing similar log from FE discover two history IP in different clusters.

{“level”:“info”,“ts”:“2021-07-29T21:55:41.351Z”,“msg”:“Current reachable members”,“service”:“frontend”,“component”:“service-resolver”,“service”:“history”,“addresses”:"[10.32.159.46:7934 10.32.41.15:7934",“logging-call-at”:“rpServiceResolver.go:266”}

What is the recommended way to solve, is that assigning different membership ports for all services in each temporal cluster?

How would the cross cluster IP address discovery happen in the first place?

for the time being, try to use different sets of membership port for different Temporal clusters.

the issue can appears if k8s decide to recycle pods, so one pod can move from one Temporal cluster to another Temporal cluster (same k8s cluster, but diff Temporal clusters)

In my case, it is across k8s clusters(one temporal cluster per k8s cluster).

I need to update k8s yaml file membershipPort only, or both membershipPort and temporal dockerfile expose?

is it possible that 2 k8s clusters can talk to each other?.. not a k8s expert

membershipPort should do the job
if dockerfile expose the membershipPort, then this should also be updated?

After adding cluster specific membershipPort in services.service.rpc, I am still seeing service grpc ports in different cluster get discovered. Do I need to use different grpc ports per cluster?

Also tried to remove DB key space in a different region(use to be both us west and us east now us west only), that improved qps load on existing clusters from 10X to 3X in cluster 01 and 02. But the shard steal is still there. However the staging env is good without the issue. Any idea I could try?

history:
rpc:
grpcPort: 7934
membershipPort: 6734
bindOnIP: “0.0.0.0”

{“level”:“info”,“ts”:“2021-07-30T05:29:57.931Z”,“msg”:“Current reachable members”,“service”:“frontend”,“component”:“service-resolver”,“service”:“history”,“addresses”:"[10.32.187.66:7934 10.32.36.243:7934",“logging-call-at”:“rpServiceResolver.go:266”}

Do I need to use different grpc ports per cluster?

If clusters are on different networks, then you don’t.
Are you using different membership ports for different services (history, matching, frontend, worker) though? Those must be different.

Yes, I am using different membership ports for different service.

Seems when I give unique grpc ports, the shard rebalancing issue disappears, but that just leave a large number of ports to manage (4 membership ports + 4 grpc ports) * cluster number.

Seems when I give unique grpc ports, the shard rebalancing issue disappears

Good. This confirms that misconfiguration of membership was the source of the sharding confusion.

but that just leave a large number of ports to manage (4 membership ports + 4 grpc ports) * cluster number.

I would argue that 8 ports per cluster is not too many, considering the 64K theoretical limit on the number of ports. A popular approach I’ve seen is to assign a base port number to cluster, e.g. 10000, 10100, 10200, and so forth, with individual port numbers being a function of the base. It makes port assignments easily automatable.

This shard rebalancing and high QPS on DB is related to grpc port and membership port setup. I also tried to use turl to verify ring pop per other post, what confuse me was response IP does not match service IP in environment where there is only one temporal cluster. In environment where there are 2 temporal clusters, response IPs come from 2 temporal clusters and response IP does not match key either. What could be the possible reason and will this affect functionality? Workflow seems running fine here.

This is what I see for 1 temporal cluster in 1 k8s cluster
[ // listing pod IP: membership port
“10.32.1.143:6933”, //fe
“10.32.36.210:6934", //history
“10.32.43.126:6935”, //matching
“10.32.52.187:6939" //worker
]
tcurl ringpop -P hosts.json /admin/lookup ‘{“key”: “worker”}’
{“ok”:true,“head”:null,“body”:{“dest”:“10.32.1.143:6933”},“headers”:{“as”:“json”},“trace”:“10d584998804bd65"}
tcurl ringpop -P hosts.json /admin/lookup ‘{“key”: “frontend”}’
{“ok”:true,“head”:null,“body”:{“dest”:“10.32.52.187:6939”},“headers”:{“as”:“json”},“trace”:“0131563ac7cf61e2"}
tcurl ringpop -P hosts.json /admin/lookup ‘{“key”: “matching”}’
{“ok”:true,“head”:null,“body”:{“dest”:“10.32.36.210:6934”},“headers”:{“as”:“json”},“trace”:“e3cbc5c3f423caa0"}
tcurl ringpop -P hosts.json /admin/lookup ‘{“key”: “history”}’
{“ok”:true,“head”:null,“body”:{“dest”:“10.32.52.187:6939”},“headers”:{“as”:“json”},“trace”:“a601c9915717d058"} (edited)

As I see, this got answered in Slack with the same recommendation to use unique membership ports for each service of each cluster when running in an environment where cross-clusters network connectivity is enabled.

Yes, thanks for your time

1 Like