Spooky Stories: Tales from the Temporal Trenches (Part 2)

temporal_angie · November 27, 2024, 3:41am

Part of our Spooky Stories series, this talk was given by @tihomir originally as a Replay 2024 lunch and learn titled “Develop Your Temporal Applications with Confidence.”

Video: Spooky Stories: Terrifying Tales from the Temporal Trenches
Slides: 2024-10-31 - Terrifying Tales from the Temporal Trenches (Tihomir Surdilovic).pdf - Google Drive

(Continues from Part 1)

Handling Resource-Intensive Provisioning

The next thing we’re also going to look at is not only if any of our downstreams are rate limited, but are any of our downstream or computations that our activities do resource intensive? Do any Activities utilize a lot of resources on your compute? You also have to think about that: It’s not only about downstream services and the rate limits, it’s also about what your Activities do.

For example, a lot of use cases process large files, sometimes that deal with large fan outs even in Activities, or start doing some computations that are very resource intensive or may not be resource intensive, but they’re running on a very small compute where just a single activity execution is utilizing 90% of the CPU and the worker. Or, memory intensive code can also be a thing.

Offloading Resource-Intensive Activities to their own Task Queues

Let’s say Activity B is very resource intensive. Again, it makes sense to offload that Activity to its own Task Queue so we can optimize Worker resources for that Activity. We typically don’t want this single Activity or a few small Activities that are very resource intensive to start blocking other Activities or Workflows, especially if they’re running on the other Workers.

So:

Schedule Activity B on an Activity-specific Task Queue
Optimize Activity B workers’ resources (num cores/heap size)

What are some of the important things to keep in mind?

Monitor activity timeouts Why? Because usually you don’t want resource intensive Activities to time out, because if they time out, they might be retried. If they are retried, we might be running this very expensive computation again, and we want to avoid that.
Useful metrics to monitor:
- Worker metric: temporal_request_failure
- Server metric: start_to_close_timeout
  For operation TimerActiveTaskActivityTimeout
You can actually check activity timeouts on the server side as well. It’s not going to give you the Activity type, but at least will tell you these Activities have been retried.

Another thing to never do (and if you take a few things from this talk, take this one :-)) please don’t disable activity retries. Never set max attempt count of activities in activity retry options to 1. Always set it to 2 or 3 or a small number because if you set retries to 1, Temporal unfortunately cannot a 100% guarantee that activity will ever run on your workers. This unfortunately comes up a lot.

Prevent Activity Timeouts in Activity Code

Regardless of what you do—whether you call downstream APIs, or if you have resource intensive activities—preventing Activity timeouts should be a priority in your case.

Let’s take a look at a couple of our options.

Set a timeout on your Activity that’s less than StartToCloseTimeout

One thing that a lot of our users and customers started doing is you can get the StartToCloseTimeout of the Activity. And once you have this—let’s say it’s for 10 seconds—if we start connecting to things such as a database, downstream connections, or an HTTP client, we want to make sure we that we set a timeout on this connection that is less than the 10 seconds that we configured for the Activity StartToCloseTimeout.

Here’s an example, in Java:

@Override
public void doA() {
  Duration startToCloseTimeout -

  Activity.getExecutionContext().getInfo().getStartToCloseTimeout();

  RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(startToCloseTimeout
  .minusSeconds(1)
  .toSecondsPart())
  .build();

  HttpClient httpClient = HttpClientBuilder.create().setDefaulRequestConfig(requestConfig).build();

  try {
    httpClient.execute(request);
  } catch (IOexception e) {
    throw ApplicationFailure.newFailure(“connection timed out”, e.getMessage());
  }
}

If only connecting to this downstream service starts taking longer than our StartToCloseTimeout, that will cause an exception, and we want to explicitly fail this Activity.

The perils of “Zombie Activities”

What this is preventing is when the HttpClient in this case (or whatever you might be calling downstream) starts blocking for hours. This can happen, we even have a name for this, it’s called “Zombie Activities.” What it means that if this line of code starts blocking for hours:

HttpClient httpClient = HttpClientBuilder.create().setDefaulRequestConfig(requestConfig).build();

This will stay on your Worker and this Activity execution in your Activity executor slot on your Worker for an hour, two hours, 10 hours… as long as it blocks. It will never release the slot, which will take up the capacity of your Worker, even leading to the point where your Worker stops picking Activity Tasks anymore, and that Worker is basically out of commission.

Regardless of what your Activity does, if it’s resource intensive or not, as long as it calls something, connects to something over a protocol, make sure you set a timeout on that.

Q: How do you know if you are suffering from “Zombie Activities”?

You can go the UI, click on a Workflow that’s not progressing (especially if it normally would complete in seconds, but has been hanging out for hours). Click on this execution in the UI and on top you have Workers link, a link to Workers that are pulling on this Task Queue. In this section, each of the Worker has two table rows with typically a blue check mark. A Workflow Task Worker is a Workflow Worker and an Activity Worker. If you click on that and there is no check mark on the Activity part for the specific Worker that’s polling on this Task Queue, but you know have Activities registered with a Worker, that’s usually how a ticket gets started on Cloud, or a community post saying, “Hey, my Worker, I know I have an Activity registered, but in the UI says it’s not polling.”

Worker also emits a metric called worker_task_slots_available and you can monitor and alert on it in order to detect this as well. Another way to do it, which is through API: task-queue describe. That also includes Worker information (basically the same Worker information as displayed in the UI).

If this happens and you don’t have this timeout set on this connection, the easiest thing to do is you restart your Workers. However, in almost all cases, if the downstream keeps blocking, it will happen again. We’ve had those cases as well.

Don’t disable activity retries

Even if retries are expensive, still don’t disable activity retries:

If activity attempt > 1, check if action was performed
If action was performed, throw non-retryable application failure

Example in Java:

@Override
public void doA() {

  int attemptCount = Activity.getExecutionContext().getInfo().getAttempt();

  if (attemptCount > 1) {
    // perform checks to make sure operation was not already performed

    // if already performed
    throw ApplicationFailure.newNonRetryableFailure(“action already performed”, “action already performed error”);

}

  // run activity code
}

For Long-Running Activities, Heartbeat

if you have long-running Activities, which can happen, make sure to heartbeat. The reason for this is to detect a Worker restart or a Worker crashing sooner than it would normally take to time out.

Let’s say we have a StartToCloseTimeout of three hours. If the Activity Worker picks up an Activity Task and then for some reason goes down in the first minute, we don’t want to wait two hours and 59 minutes for a retry of this Activity. Heartbeats can help with that.

An example in Java:

@Override
public void doA() {
  ScheduledExecutorService scheduledExecutor = Executors.newSingleThreadScheduledExecutor();

  try {
    scheduledExecutor.scheduleAtFixedRate(() -> { activityContext.heartbeat(null); }, 0, 20, TimeUnit.SECONDS);
    doTheActualActivity();
  } finally {
    scheduledExecutor.shutdown();
  }
}

Sometimes, depending on your code, it’s hard to actually know where to put this heartbeat API call, so we do have samples, which vary by SDK. For example, in Python for heartbeats you apply a decorator to activity, and it will auto heartbeat. In Java it will auto-heartbeat with just Java a thread pool, or in Go, you heartbeat from a separate Go routine. There are ways to do this automatically, so you don’t have to figure out where to call your heartbeats in the Activity code because while it’s easy in loops, sometimes it’s harder.

Fetching Customer Data

Finally, we get to a part where we start talking about Temporal for our use case. Except… we’re actually going to be talking about something else. It’s going to be data related. Our Workers are going to call Activities that pass data that are probably used to communicate with the Resource Provisioning Service APIs, but also those APIs and downstream services return results.

Something that’s important to ask yourself is, “What is the size of these data payloads that we’re communicating between Workflow Activities and downstream services, or databases, etc., because there are payload size limits (which max out at 2MB). So before we start writing our code and start talking to Service A, we have to understand what kind of data we’re talking about here. Are we talking about bytes, megabytes, or what? You have to understand that.

The payload size limits are important because it impacts our implementation. If we’re passing small amounts of data, we can just pass it as a payload. Large amounts of data are normally passed by reference, which is the recommendation. But there can be more to it, especially when we start talking about async activities or concurrently running activities.

Monitoring for exceeded payload limits

A situation that can happen is a Workflow fetches all customer resources. For a specific customer, the result of activity can exceed the Payload size limit (or even the gRPC message size limit).

To monitor for this situation, look at the extremely useful SDK metric temporal_request_failure (make sure you write it down!) for the operation RespondActivityTaskCompleted.

Some approaches to deal with this situation:

Write a custom Data Converter for data compression
Activity returns “paginated” response, testing needed to Determine page size
- Make sure your workflow handles event history size limit
- Monitor via server / cloud metrics:
  - workflow_terminate
  - temporal_cloud_v0_workflow_terminate_count
Have the activity write this data, let’s say to a blob store and return only a reference or a URL to the blob store (e.g. an S3 URL).

Here’s a code sample for paginated Activity result:

do {
  res = activities.getCustomerResources(customer.page);
  List<CustomerResource> customerResources = res.getCustomerResources();

  // process customer resources for this page…
} while (res.hasMorePages() || Workflow.getInfo().isContinueAsNewSuggested());

if (Workflow.getInfo().isContinueAsNewSuggested()) {
  // continueasnew passing in res.getNextPage as input
}

The Need for Parallelism in Fan-Out Batch Processing

The next question we’re going to look at is the maximum number of resources per customer or the need for parallelism. One of the things especially in batch or fan out use cases, we have to understand the parallelism of this use case. Are we concurrently provisioning 10 items max, or are we provisioning possibly hundreds of thousands or millions of items?

Sometimes we have customers such as banks and they provision millions of transactions, let’s say at 6:00 PM at night after the day is over. So this is a huge fan out, but we have to understand out of those million records that we might need to process (or in this case provision so many resources for so many customers), how many do we have capacity on our Worker level, our Service level, our API level with all those rate limits… How much can we process or want to process in parallel?

This will highly depend if we care if our process takes 10 hours to finish or there is a deadline for this. And sometimes there is, and you have to provision a million customers within an hour.

Here are different approaches for either smaller or larger fan outs or batch processing use cases:

Parallel processing of “batches” of activities/local activities
Fan-out based on child workflows
Fan-out in long-running activity

Usually when you have this type of batch processing or fan out use cases, you’re going to pick one of these three options. And there are some best practices and criteria of which option to pick.

Parallel processing of “batches” of activities/local activities

The first approach is parallel processing of batches of activities. The idea is that you start a batch of, let’s say we provision 100 customers, we pick 10 at a time, we process 10 of them and we have a loop in our workflow code that then processes 10 at a time. And if this is fine with your use case and you don’t care if this takes a long time or possibly forever to finish, this is a good approach.

Here’s a code sample demonstrating this:

// process resources in each “batch”/partition in parallel
for(List<CustomerResource> batch : customerResourceBatches) {
  List<Promise<Void>> batchPromises = new ArrayList<>();
  for(CustomerResource resource : batch) {
    batchPromises.add(Async.procedure(this::provisionResource, resource));
  }

  // Wait for all promises to complete
  try {
    Promise.allOf(batchPromises).get();
  } catch (ActivityFailure af) {
    // .. handle failure if needed here
  }

  // need to iterate through all promises
  for(Promise<Void> promise : batchPromises) {
    If (proimise.getFailure() == null) {
      promise.get(); // will unblock if already completed
    }
  }

  // make sure we check if need to continueasnew
  if(Workflow.getInfo().isContinueAsNewSuggested()) {
    // continueasnew, pass unprocessed batches to continued exec
  }
}

The only thing if you do that in Workflow code, make sure that you do handle ContinueAsNew, because a single execution does have limitations of event history count. So you will have to check to see if you need a continuous execution or not.

The size of the combined inputs of activities in a batch is another important thing. At this point:

Promise.allOf(batchPromises).get();

…we’re waiting for 10 activities to complete. But the way Temporal works is by the time we’ve waited for these 10 activities to complete, this is actually when your worker will send a request to the service, “Hey, please schedule my 10 activities in a single gRPC request.” So if the combined input of my Activities that I’m processing in parallel exceeds the 4MB gRPC limit, we’re going to have a problem.

And this comes up a lot, because in the documentation and samples and things like that, there is a limit of what we call 2000 parallel (or what we call “pending”) activities. Meaning that in a single Workflow execution, you can actually have this batch size be up to 2000 Activities. However, if only the inputs of only 10 of these 2000 Activities combined exceeds the 4MB gRPC limit, we can’t invoke even 10.

So please make sure that you understand when you start running anything concurrently or in parallel—Child Workflows, Activities, etc.—you understand this 4MB gRPC limit, and how many of X you can start running concurrently before you reach it. This can even bite much smaller workloads running in parallel if the inputs are large enough payloads that they run into these limitations. (We are also working further on solutions for this, so the Worker can detect this situation proactively and break it into several requests rather than just fail, since this situation can come up quite often.)

Some additional tips:

Test the size of your combined inputs of activities in a batch before you go to prod
- Remember that the gRPC message size limit is 4MB
- Worker Metric: temporal_request_failure
Don’t forget to partition event history (ContinueAsNew)
- Server Metric: workflow_terminate

Pros of this approach:

Useful for small-to-large number of customer resources

Cons:

Can potentially create large number of executions (ContinueAsNew)
One batch is processed at a time
Long provisioning of single customer in batch can add delays of whole batch

Fan-out based on Child Workflows

Another approach for fan-outs could be fan-out based on Child Workflows. The limits we discussed above around event history size and count limits are per single execution, and you can nest Child Workflows. Sometimes this fan-out or batch processing use cases can be supported by a small number of Child Workflows, and they themselves create a number of Child Workflows and you can nest them on several layers.

Pros of this approach:

Can support larger parallel fanouts
Can eliminate need to worry about event history limits and calling ContinueAsNew

Cons of this approach:

Creates larger number of workflow executions
Can be harder to debug, because if there are failures you have to find issues across larger set of executions
Need a Workflow ID strategy to determine what child workflow execution belongs to which parent, and which parent the parent belongs to
Can require more visibility/insights/testing
- Server Metrics: workflow_terminate, workflow_continued_as_new

Fan-out in long-running Activity

This approach is for very, very large fan-outs, if you start processing millions and millions of transactions or batch processing executions. Sometimes just doing them in a single long-running Activity is something a lot of users do.

This is a single Activity that:

Reads all customer resources
Processes all in parallel inside activity

Pros:

Don’t have to worry about event history restrictions, payload, gRPC message size limits
Smaller number of Actions (for cost-consciousness on Cloud)

Cons:

Have to set StartToClose timeout on activity, which will be often large.
- This can be hard, because if you want to start 10 million executions in an Activity, it’s hard to know how long that’s going to take because you might get rate limited on the front end service.
- This could also take longer sometimes than others depending on how the cluster is set up, what the RPS limits are, etc.
Activity has to heartbeat, because it’s potentially a long-running activity.
- Heartbeat needs to include last processed resource in the payload
- This allows us to “resume” from the last item processed in case of Activity timeout or Activity Worker going down, which would cause Activity retry.
Important to detect and try to prevent Activity timeouts
Cannot 100% guarantee resume from last processed resource after Activity timeout
Need to implement rate limiting yourself

Which Fan-Out Strategy should I choose?

It always will depend on the size of items you want to iterate. The second thing we usually recommend is to look at your resources, and how much you want to actually pay for your compute power on your Worker site to process this use case millions or thousands of executions as fast as possible.

Each use case might be different, but generally:

Med-to-Large (hundreds to thousands): fan-out using single Workflow and batches of Activities
Large-to-XL (thousands to tens/hundreds of thousands): fan-out using Child Workflow approach
XL-to-?XL (hundreds of thousands to ?): fan-out using single Activity approach

In Closing…

That’s it for the slides. I know this was a lot. But again, if you’re a customer, reach out, here are links for use case and code reviews, and for community sessions:

Customer Use Case and Code Reviews:
- support.temporal.io
- t.mp/slack #support-cloud
- Or reach out to your Temporal Sales Contact
Community Sessions:
- Design Discussions / Worker Tuning: Calendly
- Observability - Worker and Service Metrics: Calendly

And speaking of Spooky Stories, another thing that’s extremely spooky is ERRORS. So check out our new free hands-on courses on Crafting an Error Handling Strategy.

Questions

Q: It seems like a large percentage of the things you look into don’t really have anything to do with Temporal at all, but are downstream services, or the database it’s talking to, etc. Is this pretty typical in your experience?

Temporal is an orchestration engine at the end. And the things that you orchestrate are important to understand and will affect a lot of things that you do. And also given that your Workflow code and Activity code are run on your premises, so on your computer, not running on Cloud (Temporal does not run your code), there is still a factor in Temporal of that as well. And it’s an integral part of your Workflow executions that you run and you control. So those two all things combine orchestrating downstream services, but that’s what you get. You have a highly scalable fast orchestration engine that’s also polyglot and gives you all the resilience to go with that. But even if you didn’t use Temporal for example, you would have to think about a lot of these things. So it’s not even related just about Temporal, but at the same time if you use Temporal hopefully, for many of the reasons to orchestrate these services, then this gives you all the insights and you get visibility, you get alerting power, you get logs. You saw all those different rate limits that you get out of the box, all these features that you get that you might not have otherwise.

Q: How much of what you talked about goes away with Temporal Cloud?

We talked about a lot of the server metrics and making sure you scale up your cluster, that you don’t hit your RPS limits. If you’re on Cloud, that is done for you. So if you hit any of these limitations, if you run into issues specifically on a lot of different things, you don’t have to worry about it. You open a support ticket and we will take a look and we deal.

Another thing is we have a lot more resources, for example, the database itself. You can run into a database limitation on your on-prem deployment and you’re going to start having issues. We can auto-scale a lot bigger on the Cloud side. So your use case as it grows over time, it’s a lot easier to maintain that on Cloud, where on-prem… And I’m a big supporter, I actually support our community and I’ve helped so many customers. So I’m not coming from this from a sales perspective, I’m coming from a maintenance perspective, where on-prem as your use cases grow. We’ve seen some companies grow 1000%. And what used to be, let’s say 100 workflows a day became 100 million workflows a day. And maintaining your on-prem cluster, your databases, your version upgrades, any issues, it’s an additional load for you where on cloud you can really… I’m not saying there are no problems, you still have to tune your Workers, you have to understand your rate limits of your downstream API. So like 80% of the things we talked about still applies on both sides, but you don’t have to worry about probably the hardest part of it, which is the service maintenance and growth over time and updates and database issues, data loss. So it is easier, but to everyone their own. At the end it’s your decision how you want to approach it.

Q: What is the “spookiest” situation you have run into as a developer success engineer? Like a bug that still haunts you or something that was particularly gnarly that led to kind of a cool outcome?

We talked about Activity timeouts, and there’s one situation actually I will never forget. We had a large customer a couple of years ago where they ran into it and we didn’t fully understand why yet. We didn’t know what metrics to look at. We learned a lot along the way from when Cloud started. And okay, it could be just me because I learned over time over those three years at Temporal, too. But had a situation similar to what we talked about, where a very large customer said, “Our workers stopped picking up activities.” And we were scrambling left and right to figure out why. And it ended up being exactly the thing, their downstream service started blocking, their Workers stopped picking up Activity Tasks and their Workers stopped doing anything. Basically it’s no longer alive. It’s basically canceled. And figuring that out, and that’s why we even call them “Zombie Activities.” There’s a spooky name for you.

—

Topic		Replies	Views
Spooky Stories: Tales from the Temporal Trenches (Part 1) Developer Corner community-live	0	98	November 27, 2024
File processing workflow with heavy CPU usage (TypeScript file processing sample) Community Support	17	56	June 29, 2025
Validation of app architecture using temporal Community Support go-sdk	10	1326	October 18, 2021
Seeing high latencies between two subsequent activity task executions Community Support java-sdk , cassandra	22	2938	July 19, 2022
Share your Temporal Spooky Stories! 🎃 Show & Tell	5	339	October 15, 2024