Functional requirement definition

Hi there,

I’m a Business Analyst who has recently joined a squad with the goal of orchestrating a number of API calls for both downstream and upstream processes.

I’m new to Temporal, so I’d like your guidance on how to effectively document functional requirements for my development team.

I’d like to get a template recommendation that includes all possible curve balls that Temporal could throw at us

  • Workflow definition

  • Workflow initialization

  • Namespacing

  • Encryption?

  • lifetime of workers

  • Activity

  • Conditional logic?

  • task queue

  • retry policy

  • What are the best practices for documenting functional requirements for workflows in Temporal?

  • Are there specific templates or formats that work best for Temporal-based projects?

  • How should I document the interactions between different API calls within Temporal workflows?

  • What details are crucial to include for both upstream and downstream API calls?

  • How should I document error handling and retry mechanisms within Temporal workflows?

  • Are there common patterns or strategies for managing failures and retries that I should be aware of?

  • What security considerations should I keep in mind when documenting requirements for Temporal workflows?

  • Are there any compliance requirements or best practices that are specific to Temporal?

Your insights on these questions would be incredibly helpful. Thank you!

@Cristian_Sequera, I’m writing from a developer perspective.

  • Workflow definition

The workflow defines the logic of what you want to do, when, under what conditions. For example, “after a month, send the reminder email prompting the customer to sign up for pro, unless the already have”. Or, “perform the two basic kinds of background checks if the standard package had been purchased, or all five background checks if the full package had been purchased”.

See What are the requirements of the Background Check application? | Learn Temporal for an example of requirements.

When choosing how to divide up work between workflows, one consideration is that millions of workflows can execute in parallel but a single workflow instance can only process around 10 events per second. Thus, for example, while it would be technically possible to have a workflow representing a particular bank account balance and to have the workflow respond to signals or updates to implement all the operations on that account, this would be impractical for a busy account. Instead, it’s typically better to implement each operation being performed as a workflow. (See for example Run your first Temporal application with the TypeScript SDK | Learn Temporal, which transfers money between accounts, refunding the first account if the deposit to the second account fails).

Now that the work of the application is divided up between workflows, you need to consider the implications of having workflow executions proceed in parallel. For example, suppose I start one workflow execution to deposit $1,000 and then some seconds later start another workflow execution to withdraw $800. It might be that 99.9% of the time the deposit will complete before the withdraw is processed. But it’s always possible that the first workflow execution is delayed, whether by worker availability, network glitches, or whatever; so on occasion it withdrawal might be attempted first. (A solution in this case might be to wait for the deposit to complete before initiating the withdrawal, which itself could be implemented as a workflow). As part of your functional requirements you need to determine when it’s important that operations occur in a particular order so that can be part of the design.

Your functional requirements should also reflect that signals to a workflow are not guaranteed to be delivered in order. A temporal client makes a call to the Temporal SDK to send a signal, and the call successfully returns. That signal is now guaranteed to be delivered. The client then calls the SDK to send a second signal, and that call also successfully returns. The second signal is also guaranteed to be delivered. However, it’s not guaranteed that the first signal will be delivered to the workflow before the second signal.

“Deposit $1,000; Withdraw $750” may have a different outcome than “Withdraw $750; Deposit $1,000”.

Developers can sometimes be blase about such timing considerations. “Oh, that would be quite rare,” they might say. Perhaps in practice signals would end up being delivered in order 99.9% of the time.

Well, 99.9% sounds good, but if you were doing a million operations, that would mean that the “rare” occurrence would happen 1,000 times. Or, perhaps, you’re doing something important enough that you’d prefer your system to always operate correctly, instead of just usually correctly.

If operations need to be performed in a particular order, there are ways to design for that. For example, a list of operations might be sent to the workflow in a single signal. Or, perhaps you might label the operations 1, 2, 3; and if the workflow receives 2 before 1, it would hang on to 2, wait for 1, and then process them in order.

Note that unit tests are a good place to test that workflows operate correctly with such timing and ordering considerations. When QA testing a live system in staging you might never see signals being delivered out of order; but it’s easy to unit test a workflow with signals being delivered in a rare order and check that the workflow handles it correctly.

Another consideration is that each step that a workflow performs adds to the workflow’s event history, which is limited to 51,200 events or 50MB of event history data; you’ll get a warning if you hit 10,240 events or 10MB. You’ll want to design workflows to avoid getting the warning (which should alert your ops team), a common approach is to use continue-as-new.

  • Workflow initialization

Because the workflow simply implements the logic of what the application should do, workflow initialization is usually pretty simple, the initial state to start with. “Customers start with no subscriptions”. “Points Award starts at 1,000”.

  • Namespacing

Namespaces are used to isolate resources. For examples, you might have “dev”, “staging”, and “production” namespaces. This ensures that development and testing in dev or staging won’t affect production. Thus namespaces, at least when used for this purpose, is an operational requirement rather a functional one.

  • Encryption

If you use Temporal Cloud to run the Temporal service, workflow and activity workers still execute in your own environment. Data sent between clients, workflows, and activities will be seen by the Temporal service, but the Temporal Cloud doesn’t have access to clients, workflows, or activities themselves. It is best practice to encrypt any sensitive data which is being routed through the Temporal service. See Security model - Temporal Cloud | Temporal Documentation

  • lifetime of workers

Workers are long-lived but disposable. Any particular instance of a worker can crash or be shut down without affecting the execution of the application as long there are sufficient other workers available to handle the load. Workers may be intentionally shut down to reduce costs if system load has decreased, or to incrementally roll out a code update.

Workers are an operational consideration. On occasion I’ve seen developers try to implement functional requirements by restricting the execution of workers. For example, they might need something to happen one at a time, so they try having only a single worker executing. In my opinion this is generally a bad idea. Restricting worker execution means losing the advantages of availability, scaling, and being able to incrementally roll out code updates. It’s usually better to implement functional requirements in workflows.

  • Activity

Activities are where the actions of a workflow are performed, such as making API calls.

The functional requirements for the activity should document what are the normal and expected failures of the task, and what the response should be. For example, when making an API call, the call might fail because the service is temporarily down. This is a hopefully rare but normal and expected failure. The response might be to periodically retry until the API call succeeds. Perhaps if the failure continues for a long time you might want to do something further, such as returning an error to the user.

  • conditional logic

Generally conditional logic is performed in a workflow. You’ll want to specify all the possible branches of the logic.

  • task queue

Generally you’ll only need one task queue unless you have different operational requirements for workers. For example, suppose an application is making API calls and is also doing video processing. You might want to deploy as many workers as you need to make API calls but have a limit to how much video processing you can do. Putting these in different task queues means that a video processing backlog won’t delay API calls.

  • retry policy

It’s common to want to retry activities that are temporarily failing. Workflows could implement retries themselves, but this would grow the event history for each retry. To avoid this, you can specify a retry policy so that the activity is automatically retried, allowing you to set timeouts, retry interval, backoff, maximum attempts, and any errors that you need to be reported immediately and not retried. (For example, you might expect that an “invalid password” error from an external service wouldn’t be helped by retrying it a bunch of times). The workflow then gets the final result, whether that the activity (eventually) succeeded, or failed every retry attempt.

  • What details are crucial to include for both upstream and downstream API calls?

For an API call being made to the application, you have a choice of communicating with the workflow using a signal, query, or update. You should document which method you’re using and why.

Along with how you want successful API calls to work, you should also document how normal and expected failures should be handled.

For example, the throughput of a single workflow instance is limited. If in your design multiple API calls might be routed to the same workflow, would you need to handle, how would you handle, the situation where the workflow is falling behind in responding to the API calls? For example, would you want your API call to return a “503 Service Unavailable” error?

Workflows make outgoing API calls through activities. Activities need to be idempotent: executing an activity twice needs to have the same result as executing an activity once. Often this will require cooperation with the API that you’re calling. For example, a payment processing API will typically allow you to send a unique token with the payment request. A second request with the same token is ignored. But note that the details is dependent on the API that you’re calling. You need to document, for the particular API that you’re using, how that particular API call will be made idempotent.

You should also document how your workflow will respond to normal and expected failures. For example, you make an outgoing API call, but the service is temporarily down. How do you need the workflow to respond? Will you retry? For which errors that it make sense to retry and for which errors would retrying likely be pointless? If the API call continues to fail, is there a point where the workflow should stop retrying and do something else, such as perhaps returning an error to the user?

Is the API that you’re calling rate limited?

  • Are there common patterns or strategies for managing failures and retries that I should be aware of?

We’ve already covered retrying temporary failures. Another consideration is that if you’re doing multiple steps and a later step fails, would you need to undo an earlier step?

For example, if you’re transferring money between accounts, you withdraw money from the first account, but you’re unable to deposit to the second account, would you need to go back and refund the first account?

Or, for another example, you’re booking a trip. You reserve a hotel room and then attempt to book a flight. If you’re unable to book the flight, would you need to go back and cancel the hotel room reservation?

The saga pattern allows you to add compensation steps that are automatically run if a later step fails.