Getting tons of DecisionTaskTimedOut after scaling out the matching service in Uber cadence

I’m tryng to run each cadence service independently so that I can scale them in and out easily.
My teams is using docker-swarm, and we’re managing everything with a Portainer UI.
So far, I’ve been able to scale the frontend service to have two replicas, but If I do the same with the matching service, I will get a lot of DecisionTaskTimedOut with a workflow execution. Eventually, the test workflow will finish but after quite a long time. To have an idea, It would take 2 minutes with two matching service replicas, vs only 7 seconds with only one.

This is a Test environment. I’m using a dockerized cassand db (we cannot use a real one due to some budget restrictions) Maybe that’s the problem? The Docker image is configured with the following enviroment variables:

RINGPOP_BOOTSTRAP_MODE=dns
KEYSPACE=cadence
BIND_ON_IP=0.0.0.0
SKIP_SCHEMA_SETUP=false
VISIBILITY_KEYSPACE=cadence_visibility
CASSANDRA_HOSTNAME=soap_cassandra
RINGPOP_SEEDS=soap_cadence_frontend:7933,soap_cadence_history:7934,soap_cadence_worker:7939
CADENCE_HOME=/etc/cadence
SERVICES=matching

You can assume the default values for any other env var you don’t see above

The RINGPOP_SEEDS are the service names assigned to every cadence service, docker-swarm will create a DNS entry out of them as well as load balancer if there is more than 1 replica declared.

The matching service seems to start correctly, Logs:

{"level":"info","ts":"2021-02-18T22:47:36.800Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-history","addresses":"[[::]:7934]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.296Z","msg":"Created RPC dispatcher and listening","service":"cadence-matching","address":"0.0.0.0:7935","logging-call-at":"rpc.go:81"},
{"level":"warn","ts":"2021-02-18T22:47:36.321Z","msg":"Failed to fetch key from dynamic config","key":"system.advancedVisibilityWritingMode","error":"unable to find key","logging-call-at":"config.go:68"},
{"level":"info","ts":"2021-02-18T22:47:36.336Z","msg":"Add new peers by DNS lookup","address":"0.0.0.0","addresses":"[0.0.0.0:7933]","logging-call-at":"clientBean.go:321"},
{"level":"info","ts":"2021-02-18T22:47:36.321Z","msg":"Creating RPC dispatcher outbound","service":"cadence-frontend","address":"0.0.0.0:7933","logging-call-at":"clientBean.go:277"},
{"level":"info","ts":"2021-02-18T22:47:36.441Z","msg":"Starting service matching","logging-call-at":"server.go:217"},
{"level":"warn","ts":"2021-02-18T22:47:36.441Z","msg":"Failed to fetch key from dynamic config","key":"matching.throttledLogRPS","error":"unable to find key","logging-call-at":"config.go:68"},
{"level":"info","ts":"2021-02-18T22:47:36.441Z","msg":"Creating RPC dispatcher outbound","service":"cadence-frontend","address":"127.0.0.1:7933","logging-call-at":"clientBean.go:277"},
{"level":"info","ts":"2021-02-18T22:47:36.442Z","msg":"Add new peers by DNS lookup","address":"127.0.0.1","addresses":"[127.0.0.1:7933]","logging-call-at":"clientBean.go:321"},
{"level":"info","ts":"2021-02-18T22:47:36.713Z","msg":"matching starting","service":"cadence-matching","logging-call-at":"service.go:90"},
{"level":"info","ts":"2021-02-18T22:47:36.734Z","msg":"RuntimeMetricsReporter started","service":"cadence-matching","logging-call-at":"runtime.go:169"},
{"level":"info","ts":"2021-02-18T22:47:36.734Z","msg":"PProf not started due to port not set","logging-call-at":"pprof.go:64"},
{"level":"info","ts":"2021-02-18T22:47:36.799Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-matching","addresses":"[[::]:7935]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.799Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-worker","addresses":"[[::]:7939]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.800Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-frontend","addresses":"[[::]:7933]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.814Z","msg":"service started","service":"cadence-matching","logging-call-at":"resourceImpl.go:383"},
{"level":"info","ts":"2021-02-18T22:47:36.814Z","msg":"matching started","service":"cadence-matching","logging-call-at":"service.go:99"}

I can see the following error in the logs when the workflow is executing:

{"level":"error","ts":"2021-02-18T22:17:07.281Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"ae85d0ac1629:f8102a0f-406a-4fc7-8abf-e4b3fd66a278","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: ae85d0ac1629:f8102a0f-406a-4fc7-8abf-e4b3fd66a278, taskListType: 0, rangeID: 14, db rangeID: 15","wf-task-list-name":"ae85d0ac1629:f8102a0f-406a-4fc7-8abf-e4b3fd66a278","wf-task-list-type":0,"number":1300001,"next-number":1300001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:52:03.740Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: 8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca, taskListType: 0, rangeID: 16, db rangeID: 17","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"number":1500002,"next-number":1500002,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:10:10.971Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"FeaTaskList","wf-task-list-type":1,"store-operation":"create-task","error":"Failed to create task. TaskList: FeaTaskList, taskListType: 1, rangeID: 94, db rangeID: 95","wf-task-list-name":"FeaTaskList","wf-task-list-type":1,"number":9300001,"next-number":9300001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:09:53.345Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: 8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca, taskListType: 0, rangeID: 14, db rangeID: 15","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"number":1300001,"next-number":1300001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:53:56.145Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: 8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca, taskListType: 0, rangeID: 17, db rangeID: 18","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"number":1600001,"next-number":1600001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"}

Is there any way to resolve this issue?

Edit: Image version is ubercadence/server:0.15.1

above means matching service is competing to be the owner of the matching tasklist

btw, which version of cadence?

The version is: ubercadence/server:0.15.1

we no longer maintain the cadence project so there are only limited things we can do.

  1. please check if the ringpop ring is stable, e.g. matching / history / frontend hosts can talk to each other, or otherwise there can be ping pong effect, e.g. matching hosts steal a tasklist from another
  2. try to use the latest version, e.g. Handle matching task list conditional error by yux0 · Pull Request #3867 · uber/cadence · GitHub is not part of 0.15.x

First of all, thanks for taking the time to help me. I have some doubts:

  1. please check if the ringpop ring is stable, e.g. matching / history / frontend hosts can talk to each other, or otherwise there can be ping pong effect, e.g. matching hosts steal a tasklist from another

I’m sure the matching service can talk with the other 3 services, but it cannot talk with the other matching replicas. Or at least it cannot do it directly.

Is that a problem? Do I need all my matching instances to talk to each other?

at least all hosts in the ringpop need to have connectivity to another, or ringpop may give out conflicting orders, e.g. matching host 1 & 2 both owns tasklist 1

If a task queue has more than one partition then matching hosts must talk to each other. In Temporal each task queue by default is 4 partitions.

I understand! Thanks all for your help.