Preliminary investigation into idempotent signals

Temporal recommends making activities idempotent.

An inherent concern of any distributed system is that we might make a request to an external system, but then have a crash or a network failure before we get the response back of whether the request was received. At this point we don’t know whether the request did go through or not: the crash or failure might have occurred before or after the request was received. We can retry on failures (and in general will need to), but that might cause the request to be received twice.

If the request we’re making is to pay someone $1,000, we’d rather not have that be processed twice :wink:

One way that API’s handle this, and payment API’s in particular, is that the API may support an idempotency key. For an example, see Idempotent requests | Stripe API Reference

If the service receives another request with the same idempotency key as a previous request, the duplicate request can be simply ignored. Then we get “exactly once delivery”: we retry on failures so that a temporary failure doesn’t cause a request to be lost, but retries don’t cause any requests to be duplicated either.

What if we’re going the other way? We have an API, and we want to offer the client of our API the opportunity to pass us an idempotency key. Suppose our API sends a signal to one of our workflows. If a second request comes in with the same idempotency key we want it to be ignored.

@maxim pointed me to api/temporal/api/workflowservice/v1/request_response.proto at b4bdd8035cd1883aa96cbad5bb0e582850feea5f · temporalio/api · GitHub. The request_id can be used to deduplicate signals.

My understanding is that the Temporal SDK’s can use the request_id to automatically deduplicate retries from the same client.

I don’t think the Java SDK is doing this yet, at least as of 1.25.2, as far as I can tell.

Down in sdk-java/temporal-sdk/src/main/java/io/temporal/internal/client/RootWorkflowClientInvoker.java at master · temporalio/sdk-java · GitHub I can set the requestId, and over in the Temporal service temporal/service/frontend/workflow_handler.go at main · temporalio/temporal · GitHub I can see the request id coming through. I’m not seeing the unmodified Java SDK send a request id though.

The request id appears to be working as expected: if I send two signals with the same request id, only one signal is delivered to the workflow.

If I’m in the right place, the deduplication code appears to be here: temporal/service/history/api/signalworkflow/api.go at main · temporalio/temporal · GitHub

The request id doesn’t appear to be saved in the workflow event history; at least I’m not seeing it in the Temporal UI or with “temporal workflow show --output json --workflow-id …” In the code, the request id isn’t stored in the Workflow Execution Signaled Event, but is stored in the workflow’s “mutable state” (I’m not sure what that is): temporal/service/history/api/signalworkflow/api.go at main · temporalio/temporal · GitHub

I don’t know what the preferred API would be in the client SDK to send an idempotent signal. In the Java SDK when using an untyped workflow stub, the WorkflowStub interface has a signal method. If perhaps we might like to add an idempotentSignal method, that could take a signal name, the idempotent key, and then the signal arguments. It seems like it would be an easy task to thread that through to the internals where the request id would be set. I don’t know what we’d want to do with the typed workflow stubs to include a idempotency key option.

I made some minimal changes to the Java SDK to try out setting the request id when sending a signal: minimal patch to support idempotentSignal · awwx/temporal-sdk-java-idempotent-signal@29bc619 · GitHub

In my testing, it does seem to work:

  • Additional signals after the first with the same request id are ignored.
  • I can send a signal with a particular request id, shut down the Temporal service and restart, send the signal again with the same request id, and the second signal is again ignored.
  • I can send multiple signals (each with a different request id), and then the same signals again (each with the same request id as the original signal), and the second set of signals is ignored.

Originally we didn’t want to treat request-id as a business dedupe id to avoid storing them for the duration of the workflow. The current implementation does store them. But I’m not sure if we want to keep doing this.

AWS SQS for example has a deduplication interval of 5 minutes; my opinion is that this might be sufficient for network glitches but is too short for durable execution.

Suppose we have a system making API calls to the Temporal service through the Temporal SDK and our system crashes; perhaps it might need manual intervention or a bug fix to get it going again. Maybe we have 24-hour ops on call and can get it fixed in an hour or two; maybe it needs a bug fix and it takes a couple of days.

My thought is that a deduplication interval of a week is a reasonable amount of time for durable execution; enough time to roll out a bug fix if needed, but not keeping old request ids around forever.

To be really useful we would need to allow user provided uniqueness id which is not limited to uuid. While it’s possible this is a change that requires server side change. Implementing the retention for these ids adds even more complexity.

I’ll make sure that this is included into the list of improvements.

Do we actually need to remove older dedupe ids?

Some request ids are already stored in the workflow history.

The amount of data taken up by a request id is small.

If someone were really concerned about history size, they could use nanoids, which are even smaller.

It breaks deterministic replay of workflow history to delete older dedupe ids.

The request id or dedupe id used is an important part of the history to understand the operation of the system: “I sent a signal, the workflow ignored it, why?” “Look at the request id, it was the same in a previously sent signal.”

Perhaps it would be both simpler and more effective to make the request id or dedupe id part of the workflow history.

.

By the way, are request ids actually limited to uuid’s? In my testing I wasn’t using uuid’s and didn’t run into a problem (though there might have been some other part of the system that would have complained if a non-uuid request id had gotten to it)

Unfortunately, it is not enforced in the same way with every database. It must be UUID with Cassandra.