Automating Temporal: A Full View of the Netflix Temporal Platform

At this year’s Replay, I have fielded a trend of questions around how we built a platform around Temporal Cloud at Netflix. Since this is such a popular subject, and it’s clear there’s appetite across the community to do the same thing, this post will outline the full scope of what we’ve built to make Temporal Cloud behave as-if it’s just another internal service.

I’ve previously posted about our Cloud proxy service, so check that out afterwards if you want to read more on actual implementations there.

First: Why Cloud

I’ve told this story quite a few times before but for posterity, we’ll do it again. :smile:

We started on-prem, but due to internal requirements of our first use cases, we couldn’t use Cassandra as the backing store, so we were stuck with using RDS. The first use cases onboarded were enough for us to scale to the maximum size of RDS pretty quickly, so we were faced with a decision: (1) Maintain the status quo & stop onboarding new use cases, (2) Deploy many clusters, or (3) Adopt Temporal Cloud.

First, I am a firm believer in the value that Temporal provides: I think it would not be in Netflix’s best interest to prevent people from using it. Indeed, developers across the company agree with me, shown by their ardent internal evangelism. So, option one is out the door.

Second, it’s important to understand that Temporal at Netflix does not have a dedicated team (yet?). It’s not my job / our job to be running and improving Temporal at Netflix; it’s a supporting system of our greater product team and it just happens that other teams benefit from our efforts. Deploying and managing additional cells to support these new teams would distract from what we’re supposed to be working on.

Third, we consider ourselves experts with Temporal, but we’re not nearly on the level of Temporal themselves. I trust them to operate the technology better and more efficiently than we ever will be able to do–even if we fully staffed an internal team to run it.

So, option two is out, and that leaves us with Temporal Cloud.

Now, we need to figure out how to efficiently offer Temporal to anyone who wants to use it with as little effort on our part as possible. You will see this drive for being hands-off as a recurring primary theme throughout this post: Automating everything we can, as simply as we can, is above all else.

Internal Lingo

  • InfraAPI: The product I work on day-to-day. It is our infrastructure management platform for Netflix, and offers a single, common way of interacting with all of the control planes that Netflix has (a lot). You can think of it as AWS’s own Console and APIs, but for everything–including AWS–our other clouds, and systems built atop those runtimes.
  • Metatron: Our mTLS service. Each certificate includes strong identity information per-user & per-application that is making any individual request.
  • Gandalf: Our authorization system. It’s a policy-based service that easily integrates with the identities provided by Metatron.
  • Cryptex: Our high-scale data encryption service.
  • Atlas: Our timeseries metric service.

High-level Architecture

Legend:

  • Green stuff: Netflix Temporal Platform
  • Red stuff: Temporal Cloud
  • White stuff: Other Netflix services

temporaloperator

The temporaloperator application is responsible for reconciling desired Namespace CRDs created in InfraAPI by users within Temporal Cloud. It ensures that the namespace is configured correctly with the right mTLS CAs, and that the appropriate users have access (either Write or Read). Admin access to namespaces are not allowed except for members of temporal-dev Group.

The operator will also seed two Gandalf Policies: TEMPORAL-ns-{namespace} and TEMPORAL-ns-{namespace}-readers, both of which are owned by temporal-dev, assigning the Policy Editor to the Google Group set on the Namespace CRD’s owner field. The Temporal team does not manage the Gandalf Policies beyond creation: The owner of the Namespace is expected to assign the appropriate Rules.

A Namespace resource looks like the following:

apiVersion: temporal.infra.netflix.net/v1beta1
kind: Namespace
metadata:
  name: my-fancy-namespace 
  namespace: "{myapp namespace}"
spec:
  name: fancy-test
  owner: example@netflix.com 
  regions:
    - us-east-1
  retentionDays: 30 

All configuration for Temporal Cloud is performed via this resource. When the resource is created, we also create a UsersWorkflow that runs on a 5 minute schedule that ensures users who need access (based on the Gandalf policies) have correct access, and removes users that are either no longer at the company, or no longer part of the group.

temporalproxy

The temporalproxy application is the primary service for Netflix Temporal: It mediates all API and CLI access to Temporal Cloud. It’s entire goal is to make Temporal Cloud appear as-if it is a paved road application within Netflix.

The proxy terminates Metatron mTLS and authorizes the identity of the request with the Gandalf Policy. Currently, only the TEMPORAL-ns-{namespace} Policy is used for authorization of the proxy, meaning: Only writers can access the API. This is to simplify the scope of authz so that we do not need to stay on-top of which APIs are read access versus write. If we had more staffing, this is an area we would improve. Readers, to date, have only been interested in using the UI anyway.

After successful authorization, the payloads of the request are encrypted with Cryptex. The only information that is not encrypted from Temporal Cloud’s perspective are operational things like WorkflowType, Workflow ID, Task Queue Names, etc. Requests sent to Temporal Cloud use mTLS using a package called jit-pki which manages short-lived cert rotation.

Each namespace in Temporal Cloud has its own endpoint (e.g. example.xxxxx.tmpl.cloud): xxxxx is the customer ID for Netflix, and example is the actual namespace. Because of this, temporalproxy must maintain a client for each individual namespace. To that end, the Proxy informs on the Namespace CRD in InfraAPI, creating a client when the Namespace is provisioned.

temporalmetricsscraper

Temporal Cloud exposes a Prometheus metrics endpoint that we scrape using temporalmetricsscraper to ingest into Atlas. These metrics include some internal metrics that can be helpful for operators and users alike for understanding if there are internal issues with their namespace.

UI & temporalrde

Since Temporal Cloud stores all encrypted data, we integrate with their UI to configure a Remote Data Converter service, temporalrde which is called by a user’s browser to decrypt payloads for viewing.

Cloud Audit Log

Temporal Cloud offers audit logs for any changes that occur via the TCloud management API or UI, which is exported to a Kinesis Firehose Stream and that dumps into an S3 bucket.

Event History Export

All Temporal Workflows are effectively just an event-sourced log. For all completed workflows, Temporal will export the state of the workflow to an S3 bucket. Only the Netflix Temporal team has access to this bucket; we have admin tooling to allow users access to this, but it is not yet self-service yet will be eventually.

SDKs

We explicitly do not offer higher-level abstractions for SDKs. Originally, our Java SDK did have high-level SDK concepts (which I’ve blogged about here previously), but the cost of maintaining these did not pass the simplicity test. To that end, we have small factory libraries to ensure the SDKs are configured correctly for the paved road systems (logs, telemetry, mTLS, service discovery, etc).

Observability

We offer pre-built views for users to import into their own metrics dashboards that expose a combination of SDK, Service, & Cloud metrics. Similarly, we also have Alert Templates that users can adopt for baseline common alerts.

Level of Effort

So, that’s everything. Now let’s talk effort. All of our infrastructure was built in a quarter, by 1.5 engineers. The Kubernetes operator took 1 part-time engineer 3 months to complete.

Ongoing effort is minimal, and basically just is version upgrades for the proxy & SDKs. If we get dedicated staffing, we’ll likely transition to a sidecar model, instead of a central service that’s deployed globally, as well as further improving our self-service tooling.

Questions?

I didn’t want this to go too long, but hopefully it gets some wheels turning, and provides clarity. Happy to answer whatever.

8 Likes

Thank you for sharing what your team has built at Netflix. Could you elaborate on what your team built for tctl CLI and SDK? also, is only Java SDK supported or other languages as well?

Sure! Today, we don’t offer much on top of the SDKs. What we do have are thin shims for mostly common initialization: Configure mTLS, metrics, loggers, OTEL, worker service discovery integration, and the like; plumbing that is common for any given SDK across projects that is high-leverage.

We support Go, Java, and Python, and there’s minimal community support for TypeScript. I’ll onboard TS as an officially-supported language once there’s more than 1 team using it–which there isn’t right now. Any other languages follow the same adoption model: I have a doc that outlines minimal requirements for an SDK and help the team get started if they need it, then if that SDK gets popular I’ll bring the foundations under my wing.

This also goes for the CLI: We configure mTLS to talk to our proxy on behalf of users so they can just call temporal without worrying about wiring that stuff up. We do have one additive command temporal nflx batch-terminate that people use for nuking all workflows of a given type, but now that the UI supports a similar action, it’s likely it’ll get removed as its use is exceedingly rare and an anti-pattern.

Early in our journey, we did build a bunch of abstractions atop the SDKs, but since we don’t have a dedicated team (it’s basically just me), I walked this back as we grew beyond Java. Things we built, like our task queue traffic shaping library for Java wound up only being used by one org, and Temporal built a similar feature for the Go SDK based on this called Worker Sessions, so that library in Java is no longer maintained by me, nor offered in the other internal official SDKs.

So, at this point, everyone gets the OSS SDK and can do whatever they desire with it. There have been common themes of functionality people want, however, like triggering alerts when activities are retrying beyond some threshold, so it’s likely I’ll be building & maintaining such interceptors for all of our SDKs.

Should I get funding for a team, I don’t plan to alter our approach on abstractions, etc. The use cases within our company are so diverse that I’d prefer innovations start locally, then only promote to a central offering based on demand, or if we can see a clear global need for something.

Hope this helps!

2 Likes