I want to use temporal workers as runners for my multi-tenant platform

I have a multi-tenant system, where each tenant will have multiple runners, I need to run some tasks on the runners. I am using temporal for this purpose, I am creating a task queue for every tenant.

But how do I authenticate the worker with temporal server, can I first expose the server publicly and let workers connect directly using some kind of authentication, this authentication has to be specific for a tenant. One key-pair for every runner. Where key can be revoked.

How many tenants do you expect to have?
Temporal provides pluggable Authorizer and ClaimMapper components with defaults based on JWT.
See more info in this post. With this you could set up specific rules based on user roles and namespaces you provide to your tenants.

Thanks for the reply, I am expecting to have around 50 tenants max, I went through this post, if I am going to use this Authorizer and ClaimMapper,

  1. Will I have to build my own docker image with the custom components
  2. How will internal services access the temporal server, like my own workers, can they be exempted from the authentication and authorization logic
  3. Can I rewrite this Authorizer and ClaimMapper to use other mechanisms instead of JWT
  4. Is there an upper limit to the number of workers, is there any overhead in using multiple namespaces

(1) You mean if you define custom Authorizer and ClaimMapper? If so then yes, you would need to.

(2) ClaimsMapper can translate caller identity (from TLS and/or Auth Token).

(3) Yes, tls, see here.

(4) For namespaces, Temporal server does not enforce a max but total number will depend on your cluster capacity.
For workers, you typically want to start with single worker process and saturate it. See worker tuning guide.

There is a couple of key metrics that you can utilize in order to know if you need to increase your worker capacity:

a) Sync match rate (server metrics)
Useful Prometheus query:
sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

poll_success_sync measure only async matched tasks.
poll_success measures how many total tasks are delivered.
Ideally you want sync match rate to be above 99%. If it’s too low it can mean that your workers are unable to keep up and you should consider increasing your worker capacity.

b) task_schedule_to_start_latency (Server metric) , temporal_workflow_task_schedule_to_start_latency (SDK metric)

Measures latency between when a task is scheduled and delivered. If this latency is high its a strong indication to increase your worker capacity (add more workers).

c) asyncmatch_latency

Measures async matched tasks from when this task is created to when it’s delivered, including the time the task is sitting in the task queue. Large latencies can indicate that your workers are unable to pick up tasks fast enough and can be an indication to increase worker capacity.

Thank you @tihomir, for the quick reply. It really helped.