Worker trying to connect to old frontend?

Our worker services are struggling to start with errors like:

{"level":"fatal","ts":"2023-01-09T23:45:59.585Z","msg":"error creating sdk client","service":"worker","error":"failed reaching server: last connection error: connection error: desc = \"transport: Error while dialing dial tcp connect: connection refused\"","logging-call-

That looks like an old IP address of a frontend service. The actual running frontend is on a different IP address.

Running the following tcl (I think) indicates that 356 services are registered (only 3 are running, heh). I think these are old registrations from where I was trial-and-erroring getting the services stood up:

tctl admin membership list_db | grep role | wc -l

Is there some way to “purge” the list of registered servers, so that the worker can connect to one that’s actually alive? Or do I need to wait for them to timeout?

Many thanks

Although not 100% on that theory, because I also see this in the logs, which is the correct IP of the frontend :thinking:

{"level":"info","ts":"2023-01-10T00:16:02.876Z","msg":"Current reachable members","service":"worker","component":"service-resolver","service":"frontend","addresses":[""],"logging-call-at":"rpServiceResolver.go:283"}

Not sure what its missing as matcher/history connect OK.

I think it’s trying to connect to a frontend service on its own IP (even tho its not running a frontend) instead of using the one that’s already running :thinking:

nvm, setting the PUBLIC_FRONTEND_ADDRESS var helped steer this.

The only strange thing now is the worker says this


But refuses all TCP connections on port 7239