Temporal performance and high availablity

I was looking for a microservice orchestrator for one of my projects and came across Temporal. I went through the documentation and different video tutorials. I had a few queries.

  1. Most of the activities for my requirement will have to make external API calls and the workflow completion time is important. Given these requirements, should I use Local Activities to execute the API calls as they give better performance? Can I use Local Activities for long-running activities like API calls or is it used only for short-lived activities like in-memory processing?
  2. I went through the Kubernetes deployment documentation in GitHub - temporalio/helm-charts: Temporal Helm charts. It says we should not use it for production deployment. What are the changes required in this (other than the number of replicas) to be production-ready? Is there any documentation I can refer to for production deployment?
  3. In some of the video tutorials, it was mentioned that, for High Availability, we can setup Cadence Service across multiple data centers. Are Global Namespaces used to achieve this?
  4. Global Namespaces are active-passive setup. Is there a way to set up Temporal Service in an active-active mode? For example, can I deploy Temporal Service on DC1 and DC2 which is backed by a Cassandra with a multi-datacenter setup?
  1. What latencies are you targeting? Local Activities can decrease latency, but they have some other implications, so I would avoid them unless really needed. They should not be used for long-running operations.
  2. What constitutes a production deployment is a complicated question. There are multiple users running in production using this helm chart. At the same time, we are not sure what is the recommended way to run a production-worthy k8s infrastructure. At least use an external DB instead of the one deployed to k8s by the chart.
  3. Yes, global namespaces allow cross-cluster failovers.
  4. Active-active doesn’t make sense in the Temporal world. The reason is that workflows are strongly consistent and active-active is by definition is eventually consistent. Temporal supports automatic forwarding of workflow start and signal requests to the cluster in which a namespace is active. The multi-cluster Cassandra is not supported as it is eventually consistent. Also, multi-dc Cassandra is a potential availability problem as a single bad schema update can bring down all the regions.

Hi Maxim, thank you very much for the reply. I had a few follow up questions.

  1. We are trying to orchestrate 20 microservices. Each of these services has an SLA of 500ms approximately. We are targeting a latency of 15seconds. Can we achieve this using Temporal? Also, it will be great if I can get some examples of how to create Local Activities.

  2. Regarding the high availability aspects, if I have understood correctly, we can’t use multi-cluster Cassandra with Temporal. So, if I want to run Temporal Service in DC1 and DC2, each of them will run with Cassandra deployment local to that datacenter. Can both of them use the same Elasticsearch instance as the visibility datastore? We want to make sure that we can find all the Workflow execution details in a single place irrespective of where they were executed. If this is possible, what is the recommended way of setting up Kafka, two Kafka instances in DC1 and DC2, or a single instance in the same DC where Elasticsearch is running?

Hi Maxim, I had a couple of followup questions that I have posted in the thread. It will be great if you can answer them when you get a chance. This will help us to select the right framework for our requirements.

Sorry for the late reply.

  1. Yes, 15 seconds in this case is achievable. If you could run some of those activities in parallel then it would speed it up even more.
  2. Each cluster is completely independent with its own cassandra and its own ES. The replication stack will insert each workflow information into each ES cluster independently. This way even complete loss of a DC is not going to impact workflow executions and visibility.

Hello Maxim,
Wanted to check on the below, a follow up to the above thread -

  1. My team is also exploring active-active type of deployment across regions. So if each region is an independent cluster, how do you say that complete loss of a DC is not going to impact. We lose the visibility and executions of instance that is down right? How do you mean the history and data is replicated across regions for this to be possible?
  2. Would having two independent clusters with its own instance of DB and setting up async replication across db instances help with having visibility and workflow executions available on both the clusters Web UI?