Matching service high QPS on persistence

Bo_Gao · July 29, 2021, 9:03pm

Hi,

I have some problem running temporal in production. I am noticing QPS on persistence layer is 10X from matching then it normally should be(without workflow running) and history service generate a lot Queue processor pump shut down and Error updating timer ack level for shard. I wonder what could be the potential issue here.

{“level”:“info”,“ts”:“2021-07-29T19:13:33.256Z”,“msg”:“Queue processor pump shut down.”,“service”:“history”,“shard-id”:162,“address”:“10.32.11.179:7934”,“shard-item”:“0xc018b0a400”,“component”:“visibility-queue-processor”,“logging-call-at”:“queueProcessor.go:248”}

{“level”:“info”,“ts”:“2021-07-29T19:13:33.256Z”,“msg”:“Task processor shutdown.”,“service”:“history”,“shard-id”:162,“address”:“10.32.11.179:7934”,“shard-item”:“0xc018b0a400”,“component”:“visibility-queue-processor”,“logging-call-at”:“taskProcessor.go:145”}

{“level”:“info”,“ts”:“2021-07-29T19:13:33.256Z”,“msg”:“none”,“service”:“history”,“shard-id”:162,“address”:“10.32.11.179:7934”,“shard-item”:“0xc018b0a400”,“component”:“visibility-queue-processor”,“lifecycle”:“Stopped”,“component”:“transfer-queue-processor”,“logging-call-at”:“queueProcessor.go:178”}

{“level”:“info”,“ts”:“2021-07-29T19:13:33.256Z”,“msg”:“none”,“service”:“history”,“shard-id”:162,“address”:“10.32.11.179:7934”,“shard-item”:“0xc018b0a400”,“component”:“history-engine”,“lifecycle”:“Stopped”,“logging-call-at”:“historyEngine.go:375”}

{“level”:“info”,“ts”:“2021-07-29T19:13:33.256Z”,“msg”:“none”,“service”:“history”,“shard-id”:162,“address”:“10.32.11.179:7934”,“shard-item”:“0xc018b0a400”,“lifecycle”:“Stopped”,“component”:“shard-engine”,“logging-call-at”:“controller_impl.go:462”}

{“level”:“info”,“ts”:“2021-07-29T19:13:33.264Z”,“msg”:“Close shard”,“service”:“history”,“shard-id”:1124,“address”:“10.32.11.179:7934”,“shard-item”:“0xc01346f680”,“logging-call-at”:“context_impl.go:807”}

{“level”:“error”,“ts”:“2021-07-29T19:13:33.264Z”,“msg”:“Error updating timer ack level for shard”,“service”:“history”,“shard-id”:1124,“address”:“10.32.11.179:7934”,“shard-item”:“0xc01346f680”,“component”:“timer-queue-processor”,“cluster-name”:“active”,“error”:“Failed to update shard. Previous range ID: 14395; new range ID: 14396”,“logging-call-at”:“timerQueueAckMgr.go:402”,“stacktrace”:“go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\ngo.temporal.io/server/service/history.(*timerQueueAckMgrImpl).updateAckLevel\n\t/temporal/service/history/timerQueueAckMgr.go:402\ngo.temporal.io/server/service/history.(*timerQueueProcessorBase).internalProcessor\n\t/temporal/service/history/timerQueueProcessorBase.go:319\ngo.temporal.io/server/service/history.(*timerQueueProcessorBase).processorPump\n\t/temporal/service/history/timerQueueProcessorBase.go:194”}

Wenquan_Xing · July 29, 2021, 9:14pm

the error shown above only means some shard movement in history service

Wenquan_Xing · July 29, 2021, 9:15pm

any metrics / logs from matching service?

Bo_Gao · July 29, 2021, 9:25pm

I am seeing persistence high qps from temporal prom metrics persistence_requests.
Does the big range ID indicates that membership ring problem?
“error”:“Failed to update shard. Previous range ID: 14395; new range ID: 14396”,“

I do not see error msg from matching service, mostly just workflow life cycle logs.

Wenquan_Xing · July 29, 2021, 9:27pm

first this is history service log, not matching

if this kind of error alway happens, then you need to check your network configs see if your service host can still talk to each other

and maybe you are encountering this issue?

github.com/temporalio/temporal

Prevent incorrect service discovery with multiple Temporal clusters

opened 07:58PM - 27 Jan 21 UTC

closed 06:20PM - 01 Dec 22 UTC

skrul

enhancement difficulty: easy

**Is your feature request related to a problem? Please describe.** I run mult…iple separate Temporal clusters within a single k8s cluster. Each Temporal cluster has its own separate set of frontend, history, and matching services as well as persistence. Let's say I am running two Temporal clusters called "A" and "B" in a single k8s cluster. Note that in my setup, there are no networking restrictions on pods within the k8s cluster -- any pod may connect to any other pod if the IP address is known. I recently encountered a problem where it appeared that a frontend service from Temporal cluster A was talking to a matching service from Temporal cluster B. This happened during a time where the pods in both of the Temporal clusters were getting cycled a lot due to some AZ balancing automation. It also happens that this particular k8s cluster is configured in such a way that pod IP address reuse is more likely than usual. Both Temporal cluster A and B are running 3 matching nodes each. However, I saw this log line on Temporal cluster A's frontend service: ``` {"level":"info","ts":"2021-01-27T00:34:15.414Z","msg":"Current reachable members","service":"frontend","component":"service-resolver","service":"matching","addresses":"[100.123.207.80:7235 100.123.65.65:7235 100.123.120.28:7235 100.123.60.187:7235 100.123.17.255:7235 100.123.203.172:7235]","logging-call-at":"rpServiceResolver.go:266"} ``` This is saying that Temporal cluster A's frontend service is seeing 6 matching nodes, three from A and three from B. Yikes. I believe what led to this is something like: 1. A matching pod in cluster A gets replaced, releasing its IP address. This IP address remains in cluster A's `cluster_metadata` table. 2. A matching pod is created in cluster B re-using this IP address. 3. An event occurs that causes a frontend in cluster A to re-read the the node membership for its matching nodes. It finds the original matching node's IP address still in the table and it can still connect to it even though it is actually now a matching node in cluster B. 4. Through this matching node in cluster B the the other cluster B matching nodes are located. My fix for this is to make sure that each Temporal cluster has its own set of membership ports for each service. This would have prevented the discovery process in cluster A from seeing the pods in cluster B since it would be trying to connect on a different port. **Describe the solution you'd like** It may be possible to prevent this by including a check that a given node is indeed part of the correct cluster before adding it to the ring. **Describe alternatives you've considered** I don't believe our k8s environment has an easy way to prevent this using networking restrictions.

Bo_Gao · July 29, 2021, 10:09pm

Thanks for the pointer, I am seeing similar log from FE discover two history IP in different clusters.

{“level”:“info”,“ts”:“2021-07-29T21:55:41.351Z”,“msg”:“Current reachable members”,“service”:“frontend”,“component”:“service-resolver”,“service”:“history”,“addresses”:"[10.32.159.46:7934 10.32.41.15:7934",“logging-call-at”:“rpServiceResolver.go:266”}

What is the recommended way to solve, is that assigning different membership ports for all services in each temporal cluster?

How would the cross cluster IP address discovery happen in the first place?

Wenquan_Xing · July 29, 2021, 10:16pm

for the time being, try to use different sets of membership port for different Temporal clusters.

the issue can appears if k8s decide to recycle pods, so one pod can move from one Temporal cluster to another Temporal cluster (same k8s cluster, but diff Temporal clusters)

Bo_Gao · July 29, 2021, 10:25pm

In my case, it is across k8s clusters(one temporal cluster per k8s cluster).

I need to update k8s yaml file membershipPort only, or both membershipPort and temporal dockerfile expose?

Wenquan_Xing · July 29, 2021, 10:27pm

is it possible that 2 k8s clusters can talk to each other?.. not a k8s expert

membershipPort should do the job
if dockerfile expose the membershipPort, then this should also be updated?

Bo_Gao · August 2, 2021, 6:53am

After adding cluster specific membershipPort in services.service.rpc, I am still seeing service grpc ports in different cluster get discovered. Do I need to use different grpc ports per cluster?

Also tried to remove DB key space in a different region(use to be both us west and us east now us west only), that improved qps load on existing clusters from 10X to 3X in cluster 01 and 02. But the shard steal is still there. However the staging env is good without the issue. Any idea I could try?

history:
rpc:
grpcPort: 7934
membershipPort: 6734
bindOnIP: “0.0.0.0”

{“level”:“info”,“ts”:“2021-07-30T05:29:57.931Z”,“msg”:“Current reachable members”,“service”:“frontend”,“component”:“service-resolver”,“service”:“history”,“addresses”:"[10.32.187.66:7934 10.32.36.243:7934",“logging-call-at”:“rpServiceResolver.go:266”}

SergeyBykov · August 2, 2021, 11:37pm

Do I need to use different grpc ports per cluster?

If clusters are on different networks, then you don’t.
Are you using different membership ports for different services (history, matching, frontend, worker) though? Those must be different.

Bo_Gao · August 3, 2021, 3:18pm

Yes, I am using different membership ports for different service.

Seems when I give unique grpc ports, the shard rebalancing issue disappears, but that just leave a large number of ports to manage (4 membership ports + 4 grpc ports) * cluster number.

SergeyBykov · August 4, 2021, 12:25am

Seems when I give unique grpc ports, the shard rebalancing issue disappears

Good. This confirms that misconfiguration of membership was the source of the sharding confusion.

but that just leave a large number of ports to manage (4 membership ports + 4 grpc ports) * cluster number.

I would argue that 8 ports per cluster is not too many, considering the 64K theoretical limit on the number of ports. A popular approach I’ve seen is to assign a base port number to cluster, e.g. 10000, 10100, 10200, and so forth, with individual port numbers being a function of the base. It makes port assignments easily automatable.

Bo_Gao · August 4, 2021, 4:04pm

This shard rebalancing and high QPS on DB is related to grpc port and membership port setup. I also tried to use turl to verify ring pop per other post, what confuse me was response IP does not match service IP in environment where there is only one temporal cluster. In environment where there are 2 temporal clusters, response IPs come from 2 temporal clusters and response IP does not match key either. What could be the possible reason and will this affect functionality? Workflow seems running fine here.

This is what I see for 1 temporal cluster in 1 k8s cluster
[ // listing pod IP: membership port
“10.32.1.143:6933”, //fe
“10.32.36.210:6934", //history
“10.32.43.126:6935”, //matching
“10.32.52.187:6939" //worker
]
tcurl ringpop -P hosts.json /admin/lookup ‘{“key”: “worker”}’
{“ok”:true,“head”:null,“body”:{“dest”:“10.32.1.143:6933”},“headers”:{“as”:“json”},“trace”:“10d584998804bd65"}
tcurl ringpop -P hosts.json /admin/lookup ‘{“key”: “frontend”}’
{“ok”:true,“head”:null,“body”:{“dest”:“10.32.52.187:6939”},“headers”:{“as”:“json”},“trace”:“0131563ac7cf61e2"}
tcurl ringpop -P hosts.json /admin/lookup ‘{“key”: “matching”}’
{“ok”:true,“head”:null,“body”:{“dest”:“10.32.36.210:6934”},“headers”:{“as”:“json”},“trace”:“e3cbc5c3f423caa0"}
tcurl ringpop -P hosts.json /admin/lookup ‘{“key”: “history”}’
{“ok”:true,“head”:null,“body”:{“dest”:“10.32.52.187:6939”},“headers”:{“as”:“json”},“trace”:“a601c9915717d058"} (edited)

SergeyBykov · August 6, 2021, 11:13pm

As I see, this got answered in Slack with the same recommendation to use unique membership ports for each service of each cluster when running in an environment where cross-clusters network connectivity is enabled.

Bo_Gao · August 6, 2021, 11:50pm

Yes, thanks for your time

Topic		Replies	Views
Crash loop of history service in K8s cluster Community Support history , kubernetes	19	3732	April 30, 2021
Matching service start/stop loop in production deployment Community Support	2	2257	November 5, 2020
Temporal production deployment stopped working Community Support java-sdk , helm	7	1023	January 15, 2023
Is port change necessary for multiple deployments on a single cluster Community Support	3	815	November 21, 2022
Errors in temporal history and matching service logs Community Support cassandra , deployment	2	1214	July 7, 2022

Matching service high QPS on persistence

Related topics