Signal persistence and worker (responding to signal) crash and recover, what is the signal resume story?

How is signal persisted? Is a signal should be acknowledged? Just wonder if in case of handling a signal and worker crashed, is the signal been removed from the queue and lost?

All data inside the workflow is durable and protected against worker crashes. So if signal is removed from a queue and assigned to a local variable then the local variable is going to store it. But if your code just removes the signal from the queue and ignores its value then it is essentially lost. But in this case it is not lost due to worker crash, but because your code choose to ignore it.

The signal is posted by the external party to a workflow. I would assume that receipt workflow (actively checking) or wake up to handle the signal. Is the signal handler running in the workflow task context or the signal handler is treated/dispatched as an activity, thus dispatched as a separate task to a worker? (do we have signal workers)? I guess I would like to know when to model certain functionality as signal handlers vs activities?

The workflow is woken up to process a newly received signal. How is signal handled depends on the business logic. The handling logic can be as simple as updating a count variable or can be more complex involving activity and child workflow invocations.

I guess I would like to know when to model certain functionality as signal handlers vs activities?

The activity is a way by workflow to act upon the external world. The signal is a way for the external world to notify workflow about unexpected events. Think activity being outbound and signals being inbound.

So when a workflow has to execute some action use an activity. When workflow needs to be notified about any external event use signal.

Thank you, that definitely explains it very well.

Here is what I am thinking to model a system. We have hundred of millions of actors (of different types), each could have its own workflows (life cycles) (IOTs devices).

On top of that, think about these different type of actors each could accept a set of external commands. We could issue the commands to a single actor, a set of homogenerous actors or hetoergenious actors.

Looks like a good way to model it is to model each actor as a workflow instance itself according to. the actor type/instances, and. model the set of commands (a series of commands, referred as an operation) the command issuer to issue to reach a large set of the actors as signals to those actors. Ideally, we hope those commands could be executed as a “transaction”, so it is easy to say, is my ask successful or not, if failed, which actors failed and so on. or we may abort the processing about the operation all together. I am thinking to model the series of commands/compensation in case it failed as a workflow by itself as well.

Is this a good design? is there better ways?
I do have a number of questions on this approach:

  1. How many actors the system could support (reference deployment?)
  2. How many actors(workflows) the system could create/delete per second
    For the above two scenarios, what kind of configuration are we talking about?
  3. What would be an to end latency from the signal posted and the time the workflow responding to it?
  4. Similarly, how many task switches/dispatches and would be the configuration for that? (do we have reference performance numbers)?
  5. Can you elaborate a little more about signal implementations? Let us say external part posted a signal,
    1. where it gets stored?
    2. workflow (active/sleep) is delivered the signals.
      2.1 When is the active flow stop and check/run signal handler?
      2.2 What if the flow is sleep?
      2.3 When and who remove the signal from the signal queue?
      2.4 can you elaborate what happen during the signal handler process, what if
      2.4.1 workflow system like temporal crashed
      2.4.2 signal handler crashed
      2.4.3 Under which context the signal handler is executed?
      (I heard an activity worker and workflow worker), is that one of them or something else?

Sorry for a lot of questions. Not sure how signal is supposed to reliably processed, no duplication processing, no signal is lost and not processed.

  1. How many actors the system could support (reference deployment?)

I believe it can go to billions. We tested up to 200 million. But given enough DB disk capacity limitation is not in the number of open workflows but in the number of state transitions the system executes.

  1. How many actors(workflows) the system could create/delete per second
    For the above two scenarios, what kind of configuration are we talking about?

It depends on the provisioned DB and cluster capacity. 1k per second is easy to achieve. Around 10k is possible but requires a large Cassandra cluster.

  1. What would be an to end latency from the signal posted and the time the workflow responding to it?

If everything provisioned correctly it is in hundreds of milliseconds.

  1. Similarly, how many task switches/dispatches and would be the configuration for that? (do we have reference performance numbers)?

We don’t have reference performance numbers yet.

  1. Can you elaborate a little more about signal implementations? Let us say external part posted a signal,
  2. where it gets stored?

It is stored in the workflow event history.

  1. workflow (active/sleep) is delivered the signals.
    2.1 When is the active flow stop and check/run signal handler?

How to react to signal is your business logic. You can check for it in specific parts of your workflow or react to it immediately in a different thread for example.

2.2 What if the flow is sleep?

In such cases, it usually waits on both the timer and signal.

2.3 When and who remove the signal from the signal queue?

It depends on the SDK. In Java signal invokes a workflow method in a separate thread. In Go SDK signal is consumed from a signal channel by an application code.

2.4 can you elaborate what happen during the signal handler process, what if
2.4.1 workflow system like temporal crashed

Nothing besides short delay in processing.

2.4.2 signal handler crashed

Nothing besides short delay in processing.

2.4.3 Under which context the signal handler is executed?
(I heard an activity worker and workflow worker), is that one of them or something else?

It is executed as part of the workflow code.