Clustered deployment on nomad + consul - guidance needed on a networking related issue

Hello

We’ve been deploying temporal on Nomad+Consul (for container orchestration). The load on temporal is small as of now, so we have been able to make do with a simple monolithic deployment where one container is deployed with all the services enabled, and declaring localhost as the broadcast IP.

There is an issue that we face when we try to deploy temporal in a clustered config. Other users who have tried to deploy on nomad have also faced a similar issue. I would like to highlight what exactly is happening, and maybe someone will be able to point in the right direction.

Here is a simple clustered configuration that we are trying to test (the services are running as docker containers on a bridge network)

Each of these containers uses a dynamic port allocation, and is reachable from each other. For eg.

Matching1 is able to reach the grpc and gossip ports of all the other containers. We are checking this by just using telnet. For instance: from the matching1 container, these connections are getting established
10.0.0.2:
10.0.0.1:

In cassandra, I can see that all nodes have discovered each other. Logs also read fine on startup.

The problem starts shortly after startup in the matching service. Frequently, we see an error in the matching client as follows last connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.12.3.126:21474: i/o timeout (IP is from the test cluster).

This message always follows Received PollActivityTaskQueue for taskQueue in matchingEngine. It does not happen every time, but it happens very frequently.

Sample logs

{"level":"debug","ts":"2022-08-17T08:31:38.119Z","msg":"Received PollActivityTaskQueue for taskQueue","service":"matching","component":"matching-engine","name":"/_sys/temporal-sys-processor-parent-close-policy/3","logging-call-at":"matchingEngine.go:466"}
{"level":"debug","ts":"2022-08-17T08:31:38.164Z","msg":"Received PollActivityTaskQueue for taskQueue","service":"matching","component":"matching-engine","name":"/_sys/MONEY_TRANSFER_TASK_QUEUE/4","logging-call-at":"matchingEngine.go:466"}
{"level":"info","ts":"2022-08-17T08:31:38.166Z","msg":"matching client encountered error","service":"matching","error":"last connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.12.3.126:21474: i/o timeout\"","service-error-type":"serviceerror.Unavailable","logging-call-at":"metric_client.go:217"}
{"level":"debug","ts":"2022-08-17T08:31:38.440Z","msg":"Received PollActivityTaskQueue for taskQueue","service":"matching","component":"matching-engine","name":"temporal-sys-batcher-taskqueue","logging-call-at":"matchingEngine.go:466"}
{"level":"debug","ts":"2022-08-17T08:31:38.626Z","msg":"Received PollActivityTaskQueue for taskQueue","service":"matching","component":"matching-engine","name":"/_sys/temporal-sys-history-scanner-taskqueue-0/3","logging-call-at":"matchingEngine.go:466"}
{"level":"debug","ts":"2022-08-17T08:31:38.762Z","msg":"Received PollWorkflowTaskQueue for taskQueue","service":"matching","component":"matching-engine","wf-task-queue-name":"MONEY_TRANSFER_TASK_QUEUE","logging-call-at":"matchingEngine.go:361"}
{"level":"debug","ts":"2022-08-17T08:31:38.766Z","msg":"Received PollWorkflowTaskQueue for taskQueue","service":"matching","component":"matching-engine","wf-task-queue-name":"/_sys/MONEY_TRANSFER_TASK_QUEUE/4","logging-call-at":"matchingEngine.go:361"}
{"level":"info","ts":"2022-08-17T08:31:38.768Z","msg":"matching client encountered error","service":"matching","error":"last connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.12.3.126:21474: i/o timeout\"","service-error-type":"serviceerror.Unavailable","logging-call-at":"metric_client.go:217"}
{"level":"debug","ts":"2022-08-17T08:31:38.829Z","msg":"Received PollWorkflowTaskQueue for taskQueue","service":"matching","component":"matching-engine","wf-task-queue-name":"/_sys/temporal-archival-tq/1","logging-call-at":"matchingEngine.go:361"}
{"level":"debug","ts":"2022-08-17T08:31:39.155Z","msg":"Received PollWorkflowTaskQueue for taskQueue","service":"matching","component":"matching-engine","wf-task-queue-name":"c083a4948dd6:a4506140-b0fb-4a02-887e-8158bf289d13","logging-call-at":"matchingEngine.go:361"}
{"level":"debug","ts":"2022-08-17T08:31:39.160Z","msg":"Received PollActivityTaskQueue for taskQueue","service":"matching","component":"matching-engine","name":"/_sys/temporal-sys-batcher-taskqueue/1","logging-call-at":"matchingEngine.go:466"}
{"level":"info","ts":"2022-08-17T08:31:39.160Z","msg":"matching client encountered error","service":"matching","error":"last connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.12.3.126:21474: i/o timeout\"","service-error-type":"serviceerror.Unavailable","logging-call-at":"metric_client.go:217"}

This timeout is happening when the matching container, is trying to reach itself over the network. And we can sure enough see it by using telnet as well. For eg.

When I login to the matching1 container, and try to telnet to 10.0.0.1:, the connection cannot be established. It seems that docker has some trouble routing the traffic back from the bridge onto the same container? I am able to telnet using localhost: from within the container just fine.

So, my question is:

Why does the matching service need to reach itself over the network. What exactly is this network call possibly about? It seems to be trying to reach the grpc port of itself.

Why does the matching service need to reach itself over the network

Matching service hosts task queues, task queues can have many partitions.
If you had for example p1 (root partition of tq) and p2 that had a poller but it went down lets say and a task was placed on p2, matching service would forward the task to possibly same matching pod for it to be placed onto the root partition and be picked up by poller.

Currently this communication goes over the network, it may change in the future.

Thanks for the explanation @tihomir

For reference for others trying to use Nomad -

I posted about this on Nomad’s forums as well. It seems this is a known issue in Nomad and a PR is in the works - allow configuring hairpinning on default nomad bridge by A-Helberg · Pull Request #13834 · hashicorp/nomad · GitHub

Tried a very ugly/hacky solution to see if things work, posting here just in case someone stumbles across this issue while trying to deploy on Nomad

In client/clientfactory.go from line 171

	keyResolver := newServiceKeyResolver(resolver)
	clientProvider := func(clientKey string) (interface{}, error) {
		connection := cf.rpcFactory.CreateInternodeGRPCConnection(clientKey)
		return matchingservice.NewMatchingServiceClient(connection), nil
	}

we changed this to

	keyResolver := newServiceKeyResolver(resolver)
	clientProvider := func(clientKey string) (interface{}, error) {
		println("creating new matching client - inside clientProvider with key " + clientKey)
		broadcastIP := os.Getenv("TEMPORAL_BROADCAST_ADDRESS")
		matchingPort := os.Getenv("MATCHING_GRPC_PORT")
		selfClientKey := broadcastIP + ":" + matchingPort
		println("self clientKey = " + selfClientKey)
		if selfClientKey == clientKey {
			localClientKey := "localhost:" + matchingPort
			println("replacing client key " + clientKey + " with " + localClientKey)
			clientKey = localClientKey
		}

		connection := cf.rpcFactory.CreateInternodeGRPCConnection(clientKey)
		return matchingservice.NewMatchingServiceClient(connection), nil
	}

Since we are passing environment vars to all the containers, this was good enough to test temporal on Nomad. Things work well after that. So the only show stopper for deploying temporal with high availability on Nomad seems to be the network hair-pinning issue referenced above.

1 Like