When it startToCloseTimeout happens - Will Temporal Restart the worker server on a system crash?

I’ve been digging into the retries tag, watched Maxim’s excellent YouTube video on the 4 timeouts, and read the accompanying blog post.

I’m running Temporal on a single server that handles both the Workflow and Worker.

I have a few questions:

  1. If an activity fails because scheduleToCloseTimeout is too short, does Temporal abort the activity ? Or does the server keep executing it in the background ?
  2. If my worker crashes :
  • a) due to a CPU spike crashing the Node.js process, or
  • b) it hangs (e.g. infinite loop) —and scheduleToCloseTimeout is hit, will Temporal:
  • re-execute the activity entirely, or
  • just re-enqueue it to the task queue ?

As I write this, I realize: the Temporal SDK doesn’t manage the NodeJS process lifecycle , right?

So if the worker crashes , I need to have a recovery mechanism in place , correct?

Like a restart on system crash (if possible).This means that in more complex scenarios like VPS or Bare Metal, I need to have a mechanism that would restart the servers and boot up the worker so it polls back from the queue.

Edit:

Another question. Given that NodeJS is Single Execution (Runs on a single thread - we know there are processes that spin-off other threads, like certain I/O operations)

Does Temporal know how many processes it can pull and how many activities a given server can handle?

I think the answers to these questions are: depends, and no.

  1. It depends. I know that you can set a maxConcurrentActivityTaskExecutions number, but it really doesn’t know how long a task will take.
  2. No, because that’s server responsibility, and that’s why we set the maxConcurrentActivityTaskExecutions

So, if I don’t set this maxConcurrentActivityTaskExecutions, does this mean that Temporal will keep polling as long as the polling event is queued in the event loop?

I’m bumping this!

I’m not clear what you mean by this… there’s the Temporal Service, which you can self-host or use Temporal Cloud. Then you also run workers (which always run on your own infrastructure, whether you self-host the Temporal Service or not); and there are two kinds of workers: workflow workers and activity workers. When you say you say that your server handles both “Workflow and Worker”, do you mean that you’re running both workflow workers and activity workers on the same server? (Which is fine).

If an activity fails because scheduleToCloseTimeout is too short, does Temporal abort the activity ? Or does the server keep executing it in the background ?

The Temporal Service cancels the activity. (Which fundamentally means that the Temporal Service records in its database that the activity has been canceled). If the activity heartbeats (which polls the Temporal Service), it will be notified that it has been canceled. The activity worker can continue to run if it doesn’t know that it’s been canceled because it doesn’t heartbeat (or continue to run after it’s been notified that it was canceled, such as for example because it wants to do some cleanup). In any case if the activity then tells the Temporal Service “I’m done! I have a result!”, the Temporal Service will ignore it because it looks in its database and sees that the activity had been canceled.

So if the worker crashes , I need to have a recovery mechanism in place , correct?

Yes, you run your workers yourself, and then the workers will poll the Temporal Service for work to do. So you need to restart your workers that crash (and for example, add more workers if you need to scale up, shutdown some workers if you need to scale down, etc.)

NodeJS is Single Execution (Runs on a single thread - we know there are processes that spin-off other threads, like certain I/O operations)

In your activity worker, if you’re using promises to do I/O with await etc., you’ll be fine (Node will run other code while the I/O is pending). If you’re using blocking I/O (such as for example the “sync” versions of the Node fs module functions), then you’d only want to run one worker per Node instance. (Which you can set in your worker configuration).

Thanks awwx for the response.

I wasn’t clear with the first statement. It’s as you’ve said: I’m running the workflow and activity workers on the same server.

Thanks for the clarification. Everything makes sense now.

I’m using Temporal to manage a fleet of Playwright instances for testing and discovery.

I had my crashes and hang ups when it had to go through a long running task.

I wasn’t understanding why Temporal wasn’t restarting them: it was outside its capabilities.

I will need to reengineer my approach as I have processes with unknown duration and complexity.

I don’t know if this would do what you’re looking for, but for running Playwright tests perhaps your activity worker could run “npx playwright test” in a subprocess (maybe using child_process.spawn or some such). You’d still need to restart your activity worker instance if it died like you do for all workers, but now Playwright crashes wouldn’t crash the activity worker. For example, the activity worker could report back to the workflow “the Playwright process crashed” instead of simply failing.