At this year’s Replay, I have fielded a trend of questions around how we built a platform around Temporal Cloud at Netflix. Since this is such a popular subject, and it’s clear there’s appetite across the community to do the same thing, this post will outline the full scope of what we’ve built to make Temporal Cloud behave as-if it’s just another internal service.
I’ve previously posted about our Cloud proxy service, so check that out afterwards if you want to read more on actual implementations there.
First: Why Cloud
I’ve told this story quite a few times before but for posterity, we’ll do it again.
We started on-prem, but due to internal requirements of our first use cases, we couldn’t use Cassandra as the backing store, so we were stuck with using RDS. The first use cases onboarded were enough for us to scale to the maximum size of RDS pretty quickly, so we were faced with a decision: (1) Maintain the status quo & stop onboarding new use cases, (2) Deploy many clusters, or (3) Adopt Temporal Cloud.
First, I am a firm believer in the value that Temporal provides: I think it would not be in Netflix’s best interest to prevent people from using it. Indeed, developers across the company agree with me, shown by their ardent internal evangelism. So, option one is out the door.
Second, it’s important to understand that Temporal at Netflix does not have a dedicated team (yet?). It’s not my job / our job to be running and improving Temporal at Netflix; it’s a supporting system of our greater product team and it just happens that other teams benefit from our efforts. Deploying and managing additional cells to support these new teams would distract from what we’re supposed to be working on.
Third, we consider ourselves experts with Temporal, but we’re not nearly on the level of Temporal themselves. I trust them to operate the technology better and more efficiently than we ever will be able to do–even if we fully staffed an internal team to run it.
So, option two is out, and that leaves us with Temporal Cloud.
Now, we need to figure out how to efficiently offer Temporal to anyone who wants to use it with as little effort on our part as possible. You will see this drive for being hands-off as a recurring primary theme throughout this post: Automating everything we can, as simply as we can, is above all else.
Internal Lingo
- InfraAPI: The product I work on day-to-day. It is our infrastructure management platform for Netflix, and offers a single, common way of interacting with all of the control planes that Netflix has (a lot). You can think of it as AWS’s own Console and APIs, but for everything–including AWS–our other clouds, and systems built atop those runtimes.
- Metatron: Our mTLS service. Each certificate includes strong identity information per-user & per-application that is making any individual request.
- Gandalf: Our authorization system. It’s a policy-based service that easily integrates with the identities provided by Metatron.
- Cryptex: Our high-scale data encryption service.
- Atlas: Our timeseries metric service.
High-level Architecture
Legend:
- Green stuff: Netflix Temporal Platform
- Red stuff: Temporal Cloud
- White stuff: Other Netflix services
temporaloperator
The temporaloperator
application is responsible for reconciling desired Namespace
CRDs created in InfraAPI by users within Temporal Cloud. It ensures that the namespace is configured correctly with the right mTLS CAs, and that the appropriate users have access (either Write or Read). Admin access to namespaces are not allowed except for members of temporal-dev
Group.
The operator will also seed two Gandalf Policies: TEMPORAL-ns-{namespace}
and TEMPORAL-ns-{namespace}-readers
, both of which are owned by temporal-dev
, assigning the Policy Editor to the Google Group set on the Namespace CRD’s owner
field. The Temporal team does not manage the Gandalf Policies beyond creation: The owner of the Namespace is expected to assign the appropriate Rules.
A Namespace
resource looks like the following:
apiVersion: temporal.infra.netflix.net/v1beta1
kind: Namespace
metadata:
name: my-fancy-namespace
namespace: "{myapp namespace}"
spec:
name: fancy-test
owner: example@netflix.com
regions:
- us-east-1
retentionDays: 30
All configuration for Temporal Cloud is performed via this resource. When the resource is created, we also create a UsersWorkflow that runs on a 5 minute schedule that ensures users who need access (based on the Gandalf policies) have correct access, and removes users that are either no longer at the company, or no longer part of the group.
temporalproxy
The temporalproxy
application is the primary service for Netflix Temporal: It mediates all API and CLI access to Temporal Cloud. It’s entire goal is to make Temporal Cloud appear as-if it is a paved road application within Netflix.
The proxy terminates Metatron mTLS and authorizes the identity of the request with the Gandalf Policy. Currently, only the TEMPORAL-ns-{namespace}
Policy is used for authorization of the proxy, meaning: Only writers can access the API. This is to simplify the scope of authz so that we do not need to stay on-top of which APIs are read access versus write. If we had more staffing, this is an area we would improve. Readers, to date, have only been interested in using the UI anyway.
After successful authorization, the payloads of the request are encrypted with Cryptex. The only information that is not encrypted from Temporal Cloud’s perspective are operational things like WorkflowType, Workflow ID, Task Queue Names, etc. Requests sent to Temporal Cloud use mTLS using a package called jit-pki
which manages short-lived cert rotation.
Each namespace in Temporal Cloud has its own endpoint (e.g. example.xxxxx.tmpl.cloud
): xxxxx
is the customer ID for Netflix, and example
is the actual namespace. Because of this, temporalproxy
must maintain a client for each individual namespace. To that end, the Proxy informs on the Namespace
CRD in InfraAPI, creating a client when the Namespace is provisioned.
temporalmetricsscraper
Temporal Cloud exposes a Prometheus metrics endpoint that we scrape using temporalmetricsscraper
to ingest into Atlas. These metrics include some internal metrics that can be helpful for operators and users alike for understanding if there are internal issues with their namespace.
UI & temporalrde
Since Temporal Cloud stores all encrypted data, we integrate with their UI to configure a Remote Data Converter service, temporalrde
which is called by a user’s browser to decrypt payloads for viewing.
Cloud Audit Log
Temporal Cloud offers audit logs for any changes that occur via the TCloud management API or UI, which is exported to a Kinesis Firehose Stream and that dumps into an S3 bucket.
Event History Export
All Temporal Workflows are effectively just an event-sourced log. For all completed workflows, Temporal will export the state of the workflow to an S3 bucket. Only the Netflix Temporal team has access to this bucket; we have admin tooling to allow users access to this, but it is not yet self-service yet will be eventually.
SDKs
We explicitly do not offer higher-level abstractions for SDKs. Originally, our Java SDK did have high-level SDK concepts (which I’ve blogged about here previously), but the cost of maintaining these did not pass the simplicity test. To that end, we have small factory libraries to ensure the SDKs are configured correctly for the paved road systems (logs, telemetry, mTLS, service discovery, etc).
Observability
We offer pre-built views for users to import into their own metrics dashboards that expose a combination of SDK, Service, & Cloud metrics. Similarly, we also have Alert Templates that users can adopt for baseline common alerts.
Level of Effort
So, that’s everything. Now let’s talk effort. All of our infrastructure was built in a quarter, by 1.5 engineers. The Kubernetes operator took 1 part-time engineer 3 months to complete.
Ongoing effort is minimal, and basically just is version upgrades for the proxy & SDKs. If we get dedicated staffing, we’ll likely transition to a sidecar model, instead of a central service that’s deployed globally, as well as further improving our self-service tooling.
Questions?
I didn’t want this to go too long, but hopefully it gets some wheels turning, and provides clarity. Happy to answer whatever.