Worker ID Uniqueness

In our infrastructure, PIDs and hostnames are static across workers - by default the worker processes end up all having the same PIDs and report the same hostname. In order to make the workers show up properly in the temporal UI and to attribute which worker worked on the task, we’ve had to ensure that we override how the various SDKs derive the worker-id. With the worker-id override, the # of workers properly show up and task attribution works as expected.

I haven’t dug into the temporal cluster’s task matching code, but I’m curious as to type of problems arise when worker-id is not unique across workers. We recently saw a lot of workflows time out right after the initial WorkflowTaskScheduled task. On napkin, I’ve calculated there should be sufficient worker threads vs # of running executions so they shouldn’t have timed out - they timed out before WorkflowTaskStarted so this tells me the task poller never received the workflow task. My intuition tells me the cluster gets confused if multiple workers use the same worker-id. Instead of n workers, it might thinks there’s only 1/n of that number available. But given that workers poll, I would have also expected the cluster’s task queue continue to dispatch workflow tasks as long as pollers were ready to take a task off the task queue.

I’m interested in the internals to better understand what other negative impact there might be beyond throughput. In particular, could there be wasted effort / negative impact on the persistence layer? We did see a large load on the DB but it’s difficult to determine if this was because of the worker-id non-uniqueness.

I belive worker-id is used only for visibility purposes and doesn’t affect how tasks are delivered. So overriding them doesn’t affect performance.