Hello
We’ve been deploying temporal on Nomad+Consul (for container orchestration). The load on temporal is small as of now, so we have been able to make do with a simple monolithic deployment where one container is deployed with all the services enabled, and declaring localhost as the broadcast IP.
There is an issue that we face when we try to deploy temporal in a clustered config. Other users who have tried to deploy on nomad have also faced a similar issue. I would like to highlight what exactly is happening, and maybe someone will be able to point in the right direction.
Here is a simple clustered configuration that we are trying to test (the services are running as docker containers on a bridge network)
Each of these containers uses a dynamic port allocation, and is reachable from each other. For eg.
Matching1 is able to reach the grpc and gossip ports of all the other containers. We are checking this by just using telnet. For instance: from the matching1 container, these connections are getting established
10.0.0.2:
10.0.0.1:
In cassandra, I can see that all nodes have discovered each other. Logs also read fine on startup.
The problem starts shortly after startup in the matching service. Frequently, we see an error in the matching client as follows last connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.12.3.126:21474: i/o timeout
(IP is from the test cluster).
This message always follows Received PollActivityTaskQueue for taskQueue
in matchingEngine. It does not happen every time, but it happens very frequently.
Sample logs
{"level":"debug","ts":"2022-08-17T08:31:38.119Z","msg":"Received PollActivityTaskQueue for taskQueue","service":"matching","component":"matching-engine","name":"/_sys/temporal-sys-processor-parent-close-policy/3","logging-call-at":"matchingEngine.go:466"}
{"level":"debug","ts":"2022-08-17T08:31:38.164Z","msg":"Received PollActivityTaskQueue for taskQueue","service":"matching","component":"matching-engine","name":"/_sys/MONEY_TRANSFER_TASK_QUEUE/4","logging-call-at":"matchingEngine.go:466"}
{"level":"info","ts":"2022-08-17T08:31:38.166Z","msg":"matching client encountered error","service":"matching","error":"last connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.12.3.126:21474: i/o timeout\"","service-error-type":"serviceerror.Unavailable","logging-call-at":"metric_client.go:217"}
{"level":"debug","ts":"2022-08-17T08:31:38.440Z","msg":"Received PollActivityTaskQueue for taskQueue","service":"matching","component":"matching-engine","name":"temporal-sys-batcher-taskqueue","logging-call-at":"matchingEngine.go:466"}
{"level":"debug","ts":"2022-08-17T08:31:38.626Z","msg":"Received PollActivityTaskQueue for taskQueue","service":"matching","component":"matching-engine","name":"/_sys/temporal-sys-history-scanner-taskqueue-0/3","logging-call-at":"matchingEngine.go:466"}
{"level":"debug","ts":"2022-08-17T08:31:38.762Z","msg":"Received PollWorkflowTaskQueue for taskQueue","service":"matching","component":"matching-engine","wf-task-queue-name":"MONEY_TRANSFER_TASK_QUEUE","logging-call-at":"matchingEngine.go:361"}
{"level":"debug","ts":"2022-08-17T08:31:38.766Z","msg":"Received PollWorkflowTaskQueue for taskQueue","service":"matching","component":"matching-engine","wf-task-queue-name":"/_sys/MONEY_TRANSFER_TASK_QUEUE/4","logging-call-at":"matchingEngine.go:361"}
{"level":"info","ts":"2022-08-17T08:31:38.768Z","msg":"matching client encountered error","service":"matching","error":"last connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.12.3.126:21474: i/o timeout\"","service-error-type":"serviceerror.Unavailable","logging-call-at":"metric_client.go:217"}
{"level":"debug","ts":"2022-08-17T08:31:38.829Z","msg":"Received PollWorkflowTaskQueue for taskQueue","service":"matching","component":"matching-engine","wf-task-queue-name":"/_sys/temporal-archival-tq/1","logging-call-at":"matchingEngine.go:361"}
{"level":"debug","ts":"2022-08-17T08:31:39.155Z","msg":"Received PollWorkflowTaskQueue for taskQueue","service":"matching","component":"matching-engine","wf-task-queue-name":"c083a4948dd6:a4506140-b0fb-4a02-887e-8158bf289d13","logging-call-at":"matchingEngine.go:361"}
{"level":"debug","ts":"2022-08-17T08:31:39.160Z","msg":"Received PollActivityTaskQueue for taskQueue","service":"matching","component":"matching-engine","name":"/_sys/temporal-sys-batcher-taskqueue/1","logging-call-at":"matchingEngine.go:466"}
{"level":"info","ts":"2022-08-17T08:31:39.160Z","msg":"matching client encountered error","service":"matching","error":"last connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.12.3.126:21474: i/o timeout\"","service-error-type":"serviceerror.Unavailable","logging-call-at":"metric_client.go:217"}
This timeout is happening when the matching container, is trying to reach itself over the network. And we can sure enough see it by using telnet as well. For eg.
When I login to the matching1 container, and try to telnet to 10.0.0.1:, the connection cannot be established. It seems that docker has some trouble routing the traffic back from the bridge onto the same container? I am able to telnet using localhost: from within the container just fine.
So, my question is:
Why does the matching service need to reach itself over the network. What exactly is this network call possibly about? It seems to be trying to reach the grpc port of itself.