Spooky Stories: Tales from the Temporal Trenches (Part 1)

temporal_angie · November 27, 2024, 3:40am

Part of our Spooky Stories series, this talk was given by @tihomir originally as a Replay 2024 lunch and learn titled “Develop Your Temporal Applications with Confidence.”

Video: Spooky Stories: Terrifying Tales from the Temporal Trenches
Slides: 2024-10-31 - Terrifying Tales from the Temporal Trenches (Tihomir Surdilovic).pdf - Google Drive

As you start writing your Temporal applications, whether you are just starting off with Temporal or after you’ve used it for years, there are always new things to learn. Here at Temporal, my job is to help answer questions from Temporal Cloud customers and our community members. And over time there are some patterns and best practices that have evolved and they come up a lot in these types of questions that users and customers ask.

This talk is not intended to be to explicitly tell you what to do, but more to build up your confidence and boost your experience level a little bit about how you approach your Temporal applications or at least things that you want to look at or consider when you do.

Let’s take a look at a sample use case.

Use case: Resource Provisioning

While this could be any use case, one that comes up a lot is some sort of resource provisioning. The idea is that you usually have some customers that through various UIs interact with some of the existing API services that you might have on the left-hand side.

These customers for this use case do some resource provisioning requests, whether it’s provisioning hardware, provisioning infrastructure, provisioning payments, provisioning account setups, whatever that provisioning might be. But the idea is that this provisioning request coming from the customers goes to an existing API and gets translated into Workflow executions that Temporal is scheduling and executing for us.

Then, the Temporal Client API talks to the Temporal Service (or Temporal Cloud, in this case), which then communicates with your Workers through your Workflow definitions to run business logic that then ends up calling different resources, and provisioning those resources. These are Activities that need to be invoked during this provisioning process for this customer.

So this is basically kind of like a fan out batch processing use case that comes up a lot. But the lessons in this talk are more general than this one specific use case. Have your own use case in mind as we go, and you’ll probably find bits and pieces that are relatable to you as well.

During a design review, as we start looking at how we’re going to implement this, it usually starts without even looking at Temporal at all, or at least not directly.

Focusing on the existing Provisioning API Service

First, we actually look at this Provisioning API that already exists, and also look at some of the existing services that are going to interact with the Temporal Server, whether that is on-prem or Temporal Cloud.

So what kind of questions can we ask there?

RPS Limits

The first thing we are probably going to ask is the total rate of invocation, e.g. “How many workflow starts are you expecting during some period of duration of time?” We need to assess the workload you’re trying to put on your Temporal system, whether that’s an on-prem deployment or if you’re a Cloud user, the Cloud namespaces that you have provisioned already or other Cloud namespaces that you might have in the future.

This is a very important question. Why?

Every namespace in Temporal has a requests per second (RPS) limit, and we have to understand what value to set that to, either in dynamic configuration or in Cloud. (For Cloud, we can help you determine this value through a support ticket.)

Key values to track on your graphs, and on your server metrics are:

—

Server (OSS):
service_errors_resource_exhausted

Cloud:
temporal_cloud_v0_resource_exhausted_error_count

Worker (SDK):
temporal_long_request_failure, temporal_request_failure

—-

These allow you to understand when you’re starting to hit your RPS limit. Let’s say you have a bunch of customers that click on a request provisioning workflows at the same time, you want to understand when this is happening. This might not cause a big problem—rate limiting does not necessarily mean the start of the execution is not going to happen—but it possibly could, so it helps to understand this.

The temporal_long_request_failure and temporal_request_failure metrics in particular are two of the more important metrics for troubleshooting and looking at things in Temporal in general.

Workflow ID Reuse Policy and connection errors

The second thing that we are going to look at is once again still not on the Temporal side, but it is relatable to how Temporal works.

If I’m starting Workflow executions from my Provisioning API, I want to understand: what is the Workflow ID or the identity of my Workflow execution that I’m going to start? And also, how am I planning to deal with situations where a single customer (or multiple customers) requests provisioning of a single entity in my system, whether that be with a “Customer ID” or something else.

I also want to understand how I detect and deal with connection errors, because my provisioning API talks maybe either directly to Temporal or through some proxies to sit in between such as a load balancer or some private link networks or whatever could be in between your API service and your Temporal instance. So you want to make sure you understand when there’s some issues happening between those two sides.

Here are some of the useful metrics around these pieces:

—

Server (OSS):
service_errors_with_type (filter by operation, error_type, service_type)

Cloud:
temporal_cloud_v0_frontend_service_error_count

Worker (SDK)
temporal_request_failure

gRPC Client metrics (need to configure)

—

These metrics allow us to understand when the service is reporting errors. On OSS, service_errors_with_type allows filtering by things like operation, error type, and service type. An example of an “operation” could be “start workflow execution,” and for service type we could focus on our front end service to see all the front end service or client errors that are coming through.

On Cloud, we also have temporal_cloud_v0_frontend_service_error_count and on the Worker side, again, you’ll see the temporal_request_failure metric over and over again.

In some cases, you can also add the gRPC client metrics, which is fairly useful and there is a Java SDK Metrics sample (with more languages to come). The Temporal Client API allows you to add a gRPC client library to report auto-specific gRPC errors on top of the Temporal client metrics that Temporal provides itself.

Review how workflow executions are started

Here is some typical client code (in Java, but the logic can be translated to other SDKs):

@PostMapping(value = “/provision”)
void provisionCustomerResources(@RequestBody Customer customer) {

ResourceProvisioning workflow = client.newWorkflowStub(
  ResourceProvisioning.class,
  WorkflowOptions.newBuilder()
  .setTaskQueue(“ProvisioningTaskQueue”)
  .setWorkflowId(customer.getId())
  .build());

  // Start exec async
  WorkflowClient.start(workflow::provision, customer);

}

In code reviews, we see this a lot. We see some sort of function or library that users use for HTTP invocations and they add some sort of post-mapping at “/provision” that the client invokes this API. This is pretty typical, a user finds something from our samples, uses an API to async start this. And this is fine, but whenever we see this, for example in code review, we tend to stop the user right there.

Why? Because there are a couple of things to talk about in this case.

Add error handling


// Start exec async

try {

WorkflowClient.start(workflow::provision, customer);

} catch (WorkflowExecutionAlreadyStarted | StatusRuntimeException e) {

// …

}

Number one thing is more times than not we see you’re not handling any exceptions. Especially WorkflowClient,Start() or ExecuteWorkflow() in Go and similar. Make sure that you handle the error coming back and you try / catch in different SDKs.

You have to make sure that you can handle this WorkflowExecutionAlreadyStarted error, for example. In Temporal, you cannot by default have two executions with the same workflow ID running on the same namespace at the same time. That’s something Temporal guarantees, and if you try that to start a new one, you will get this exception. Also, StatusRuntimeException is a Java GRPC exception. So make sure you handle GRPC or client errors as we talked about as well in whatever language you use.

Log / Persist failures for Workflow starts


// Start exec async

try {

WorkflowClient.start(workflow::provision, customer);

} catch (WorkflowExecutionAlreadyStarted | StatusRuntimeException e) {

persistentLogger.log(client.getOptions().getNamespace(),

customer.getId(), e.Message());

}

Another thing that is very important is to log these failures. The reason for that is oftentimes our users and customers say, “Hey, I was actually hitting very high RPS for a long time and I was being rate limited for let’s say 10, 15, 20 minutes.” And the SDK for a lot of these APIs is going to retry for up to a minute. So if I am being throttled for a very long time, or I’m having some serious connection issues, the SDK will give up at some point and you will get these failures. At that point, if you don’t log the workflow ID that failed to start, then you can’t go back and try to start those executions again. So a lot of times the user says, “Hey, I had this issue, can you tell me which workflow IDs I tried to start.” And unfortunately, we can’t.

And then that becomes an issue in how they get hold of these workflow IDs and all these executions that the client requested, but that actually never could have ended up creating a workflow execution for it. So try to log these workflow IDs on your clients, especially in case of failures.

Select Workflow ID Reuse Policy if needed


ResourceProvisioning workflow =

client.newWorkflowStub(

ResourceProvisioning.class,

WorkflowOptions.newBuilder()

.setTaskQueue(“ProvisioningTaskQueue”)

.setWorkflowId(customer.getId())

// Allow start with customer.id wfid only when the last

// execution’s final state was one of [terminated, cancelled, timed out, failed]

.setWorkflowIdReusePolicy(

WorkflowIdReusePolicy.WORKFLOW_ID_REUSE_POLICY_ALLOW_DUPLICATE_FAILED_ONLY)

.build());

Another thing that often comes up when you start workflow executions is this Temporal workflow ID reuse policy. So we said earlier that by default, you cannot start two workflow executions with the same workflow ID, but there are other options. In this case, for example, we only allow clients to start workflow executions with a certain workflow ID if the last execution of this workflow ID was one of the “failed” states: terminated, canceled, timed out, or failed itself. So start thinking which workflow reuse policy makes most sense for you and for your use case. Look in your workflow options in your SDK to read further on that.

Workflow ID Reuse Policy and Namespace Retention

Another thing that comes up is workflow ID reuse policy and namespace retention. So Temporal’s system—whether you use on-prem or Temporal Cloud—is not a long-term storage solution. Once a workflow execution completes, it is subject to what we call a namespace retention period. When you create a namespace (or you can update it later on), you set a retention period, which is the amount of time for completed executions stick around in this particular namespace. So understand that the ability for the workflow ID reuse policy to be applied is limited to the retention period that you said on the namespace.

So if for your use case, this workflow ID reuse policy should apply for much longer than the namespace retention, you need to actually handle it differently.

Now, why would you have this scenario? For OSS, a lot of users set this Temporal namespace retention to five years, which is fine. But understand that for example, Cloud users are paying for storage, and cost is going to become a factor. (This is why Cloud namespaces max out at 90 days). So sometimes what makes more sense is to consider writing the Workflow IDs that you start from your clients to a custom store, let’s say a PostgreSQL database, and check that database to determine if a Workflow ID falls within your reuse policy if for your use case, this policy has to be maintained over very, very long periods of time.

Select Workflow ID Conflict Policy if needed


ResourceProvisioning workflow =

client.newWorkflowStub(

ResourceProvisioning.class,

WorkflowOptions.newBuilder()

.setTaskQueue(“ProvisioningTaskQueue”)

.setWorkflowId(customer.getId())

// Don’t start a new workflow

// instead return a workflow handle for the running workflow.

.setWorkflowIdConflictPolicy(

WorkflowIdConflictPolicy.WORKFLOW_ID_CONFLICT_POLICY_USE_EXISTING)

.build());

A Workflow Id Conflict Policy determines how to resolve a conflict when spawning a new Workflow Execution with a particular Workflow Id used by an existing Open Workflow Execution.

This example prevents the Workflow Execution from spawning and returns a successful response with the Open Workflow Execution’s Run Id.

Think about your client certs expiration

Another thing that you have to think about that a lot of users and customers don’t, especially on Cloud, you have to use TLS or you have to connect using certificates. Think about client cert expiration and how you’re going to, for example, do a manual configuration or refreshing of your client certificates and the same applies to your workers as well. Don’t let your certs expire and your clients all of a sudden not be able to start executions, signal executions. Then you’re going to have a big problem, which happens more often than you think.

To address this, here are some tips:

Look to APIs, such as io.grpc.util.AdvancedTlsX509KeyManager in Java, where we can explicitly set a refresh period on your certificates that’s less than the expiration time. Every SDK is going to have some sort of management for this. While you can manually update your certs and restart your client application and workers, automatic refreshes are better because worker restarts can affect your workflow executions and performance.
If this is not an option, fall back to manually configuring your SSL context (see sample)
Register your custom SSL context with WorkflowServiceStubOptions

Recommendations For Workflow Options

The last thing I’m going to talk about is Workflow Options which are set when you start Workflow executions. These include the Workflow ID, the Task Queue where the execution should be running, the Workflow ID reuse policy, and things like that. Here are some lessons learned:

Do not explicitly define the Workflow Retry Policy. By default, Workflows in Temporal don’t have a retry policy. The use cases to use workflow retries are very, very rare, and it’s generally better not to set it. If you have that in your code, maybe reach out on Slack and explain your use case and we can discuss it.
Do not set Workflow Run Timeouts and Workflow Execution Timeouts. (possibly even more important) There are two timeouts that you can set on the client side when you start workflow execution. Both of these timeouts regulate how long your workflow execution can run before you want it to stop no matter what. So this is kind of like a “timed out” termination of an execution. The reason why these settings are not recommended is that your workers and your business logic are never notified when these two service side timeouts actually expire or hit. So if you set a runtime timeout of 10 minutes, if my execution—regardless if my workers are up or not—after 10 minutes the service is going to time out. So imagine a situation, you set this runtime to 10 minutes, but you have a worker failure so your executions get created, but your work workflow code never runs. In that case, after 10 minutes, your execution will time out and your workflow actually never runs at all.
Do not change the Workflow Task Timeout unless needed. This is the amount of time the service is going to wait for your worker to run one or X number of lines of your workflow code (the default is 10 seconds). While there are some reasons to change this on a use case by use case basis, typically it’s not recommended to change unless you really have a specific reason for that.

Focusing on Resource Provisioning Services

Ok, back to our use case diagram. The second thing usually we’re going to look at is again, not your Temporal Workflow or Activity code. Regardless of use case, the second thing I usually look at is the downstream resources that we need to invoke during our business logic or workflow execution for this provisioning use case. So basically these are the APIs that our Activities connect to.

There are two questions to ask:

Are these services rate limited? This is a very important question, because we end up in code reviews where people say, “Oh, well we got this 20 activities, they all talk to 20 different services. We’re just going to sync them one after another on a very large scale,” and they don’t think of this. So when you start your applications, think about rate limits of your downstream services. What are the number of concurrent calls you’re making, and the number of calls over a period of time?
Are these services kind of flaky? Even if you don’t know that these services are being rate limited, you might want to impose rate limits on your own if you know the dependent API is not going to be up 24/7, just to protect yourself. (This could also apply to a database, to prevent it from becoming overloaded.)

Temporal gives you some out-of-the-box way to rate limit these APIs. There are two methods.

API Rate Limit on concurrent calls

Let’s say Activity A (or the downstream service that Activity A invokes) is rate limited on the number of concurrent calls, for example, no more than 10 calls at the same time.

In this case, you need to do two things:

Schedule Activity A on an activity-specific Task Queue


private final ActivityA activityA = Workflow.newActivityStub(ActivityA,class,

ActivityOptions.newBuilder()

.setTaskQueue(“ActivityATaskQueue”)

// set timeouts…

.build());

Here, we are setting a custom Task Queue ID that’s unique for that Activity.

Configure one or more Activity Worker(s) that has this Activity registered, and set its maximum concurrent execution size.


Worker worker = factory.newWorker(“ActivityATaskQueue”,

WorkerOptions.newBuilder()

.setMaxConcurrentActivityExecutionSize(10)

.build());

worker.registerActivitiesImplementations(new ActivityAImpl()); // only register activity A

// …

Make sure setMaxConcurrentActivityExecutionSize is set to correspond with the API limit.

You can still scale with this approach because let’s say we have an API that tells us, “Do not call me more than a hundred times concurrently,” you can still set up 10 Workers and set setMaxConcurrentActivityExecutionSize to 10.

This comes up actually in use cases more often than you think when you have very high CPU or GPU intensive activities where some single activity execution or a worker can sometimes take like 80, 90, even up to a hundred percent CPU. In this case, some customers and users that we work with can set this to only 1, allowing just a single Activity to run on an Activity Worker at any time. And then they control the max concurrency by the number of Workers.

API Rate Limit on concurrent calls with small workloads

Sometimes customers and users say, “I don’t really feel like doing this. I don’t want to offload this Activity to a separate Task Queue. I have to set up new Workers for it, I have to maintain them, but I don’t really have a large workload that I’m expected to grow for a long period of time.”

In this case, consider actually having a Workflow execution that receives this request for Activity invocation through Signals from other Workflows.

The idea here is let’s say we have four Workflows, they all want to execute this Activity. Instead of executing it themselves, they send a Signal to the Activity Executor Workflow and this Workflow actually can then maintain through code how many of these Activities are going to be executed in parallel or concurrently. And then when it’s done, the ones that it cannot execute, it can hold onto for batch processing. And then the Caller Workflow, once it sends the Signal, it can await a Signal back from the Activity Executor Workflow with the results of the Activity.

But understand that this really only works for small workflow loads. When you start doing large workflows—in Temporal, single Workflow execution is limited, and it scales via the number of Workflow executions—a much better approach is to deal with a number of Workers in an Activity-specific Task Queue where you can scale.

API Rate Limit on number of calls over a period of time

In this scenario, your downstream API or whatever you’re calling in your Activities tells you, “Do not call me more than X number over Y period of time.” This is another rate limit that Temporal gives you out of the box.

In this case, a Worker tells the service to dispatch Activity tasks to it (and other Activity Workers polling on this Task Queue) at this particular rate. The previous setting was per-Worker; this one is at a Task Queue level. So the service has a rate-limiting approach for dispatching tasks from the matching Task Queue: Activity task per defined rate.

Try to Schedule this Activity on its separate activity-specific Task Queue
Set max task queue activities per second in worker options


Worker worker = factory.newWorker(“ActivityATaskQueue”,

WorkerOptions.newBuilder()

.setMaxTaskQueueActivitiesPerSecond(10d / 60d)

.build());

worker.registerActivitiesImplementations(new ActivityAImpl()); // only register activity A

// …

But, there are a couple of important gotchas:

Make sure all workers polling on ActivityATaskQueue have the same setting for setMaxTaskQueueActivitiesPerSecond

When you take this approach, it’s important that if you have multiple workers, they all have the same setting because the last one is going to win. If you have nine Workers that say to dispatch one per minute and the 10th Worker comes up and says to dispatch two per minute, the service is going to start dispatching two per minute because the service is going to listen to whatever the last request is. So make sure that all your Workers, Worker pods, or containers don’t change the setting for the specific Task Queue that the Activity Worker is polling on.

Be aware of the number of Task Queue partitions on ActivityATaskQueue

So many users use setMaxTaskQueueActivitiesPerSecond, and expect it to just work. But bear in mind that Temporal Task Queues have what we call “partitions,” to allow for scaling. By default, they have four partitions. An Activity task can be dispatched (or when it’s created, the History service can move it) to any of those four Task Queue partitions. And each one of these Task Queue partitions is going to determine what we call the “burst” of Activities. The idea behind setMaxTaskQueueActivitiesPerSecond is that over a period of time, the service will make sure that the rate limit is applied. Let’s say that setMaxTaskQueueActivitiesPerSecond is 10 per minute. The service will make sure that there will be 10 Activity task dispatches over a period of 60 seconds. But at time T0, the burst will determine how many activities can be dispatched at the same time per second.

The idea is with four default Task Queue partitions, in the first second you could dispatch four Activity tasks, in second two, you can dispatch another burst of four. But then for the remainder of the 58 seconds you can only dispatch two. You still get 10 over a minute, but this burst factor that is related to the number of Task Queue partitions is going to determine that dispatch of a burst. So you have to understand that as well. If the scenario is you want to dispatch 10 a minute, but you also don’t want to exceed one activity being dispatched per second, then start playing with the number of Task Queue partitions. That’s in the Dynamic Configuration: numTaskqueueReadPartitions and numTaskqueueWritePartitions. (If you’re using Temporal Cloud, you can open a support ticket and we can change these values for you.)

Let’s go through an example.

A practical example

Let’s say we set:


MaxTaskQueueActivitiesPerSecond=-0.1

This means “one per minute.” But how will that work in practice?

As expected, the service over time will throttle the Activity Task dispatch rate on this Task Queue to 1 per minute.
The “burst” per Task Queue partition is calculated as:

ceiling(rate/tq-partitions) = ceiling(0.1 / 4) = 1

Since the burst of 1 is per Task Queue partition, the burst across all partitions is 4.
Therefore, we can send up to 4 Activity Tasks per second, up to a maximum of 10 within a minute,

So it is important to think about this formula when you start using Max Task Queue.

MaxTaskQueueActivitiesPerSecond and Eager Activities

Something that can come up as a surprise is Eager Activities. A lot of users follow the advice of calculating the burst rate, changing the number of task queue partitions, but they don’t see any difference whatsoever.

The reason for that could be that you have enabled what we call Eager Activities and in some SDKs, especially the newer ones, they are enabled by default. This is because Eager Activities is an optimization, introduced a few months ago to Temporal. If a Worker says it wants to accept eager Activity Tasks, as soon as the Task is created in the History service, the History service dispatches that Activity Task back to the Worker. It never gets moved from the History service to a matching Task Queue. So if you use eager Activities, your Activity Tasks will never go in a matching Task Queue. In this situation, the MaxTaskQueueActivitiesPerSecond setting is not even applied. These rate limits are currently only applied on the matching Task Queue level (this is something that’s being worked on).

In the newer SDKs, sometimes setting MaxTaskQueueActivitiesPerSecond automatically disables Eager Activities, but not always, depending on what version you use. So check with your SDK and version, and if you don’t know, just disable Eager Activities. It’s a worker option, in your worker options.

MaxTaskQueueActivitiesPerSecond Useful Metrics

If you’re using OSS, there is a metric task_dispatch_latency_count (container=“matching”, task_type=“Activity”)

And then if you look at the rate over X time—as in our previous example, one minute—you should only see only one over a minute or 10 over a minute is actually applied.

You can also check for Eager Activity execution with activity_eager_execution which allows you to check whether the counts of Eager Activities are going up.

(Continues in Part 2)

Topic		Replies	Views
Spooky Stories: Tales from the Temporal Trenches (Part 2) Developer Corner community-live	0	59	November 27, 2024
Chilling Temporal Anti-Patterns 👻 Developer Corner community-live	0	398	October 16, 2024
Spooky Stories: Ghosts of Temporal Past Developer Corner community-live	0	12	November 27, 2024
Nightmares of Spiralling Deployment Complexity talk Show & Tell community-live	0	12	November 27, 2024
What sessions do you want to see at the next open office hours? https://lu.ma/temporal Community Support announcements	11	1327	February 21, 2021