Inconsistent Sleep Durations in Temporal Workflows

Hello Temporal community,

I’m working on a Temporal workflow, and I’ve noticed some inconsistencies in sleep durations when I use different methods for introducing delays. Specifically, I’m expecting a sleep duration of 33 milliseconds, but the actual sleep times sometimes vary significantly.

Here’s a brief summary of what I’ve observed:

  1. I’m using workflow.Sleep(ctx, 33*time.Millisecond) to introduce a delay, but I’m seeing sleep times that occasionally deviate significantly from the expected 33 milliseconds.
  2. I’ve also tried workflow.AwaitWithTimeout(ctx, 33*time.Microsecond) and workflow.Await with a timer, but I still encounter similar inconsistencies in sleep durations.

Here are some code snippets:

for {
	start := workflow.Now(ctx)
	fmt.Printf("[%s] sleeping start\n", start)
	workflow.Sleep(ctx, 33*time.Millisecond)
	end := workflow.Now(ctx)
	fmt.Printf("[%s] done sleeping (took %v ms)\n", end, end.Sub(start).Milliseconds())
}

Output:
[2023-11-13 04:42:32.960026124 +0000 UTC] sleeping start
[2023-11-13 04:42:33.003890779 +0000 UTC] done sleeping (took 43 ms)
[2023-11-13 04:42:33.003890779 +0000 UTC] sleeping start
[2023-11-13 04:42:34.006123603 +0000 UTC] done sleeping (took 1002 ms)
[2023-11-13 04:42:34.006123603 +0000 UTC] sleeping start
[2023-11-13 04:42:35.010955423 +0000 UTC] done sleeping (took 1004 ms)
[2023-11-13 04:42:35.010955423 +0000 UTC] sleeping start
[2023-11-13 04:42:36.007020711 +0000 UTC] done sleeping (took 996 ms)
[2023-11-13 04:42:36.007020711 +0000 UTC] sleeping start
[2023-11-13 04:42:37.009073729 +0000 UTC] done sleeping (took 1002 ms)

func (a *DialerActivities) PreventCpsThrottle(ctx context.Context, cpsLimit int) error {
	timeToSleep := time.Second / time.Duration(cpsLimit)
	time.Sleep(timeToSleep)

	return nil
}

for {
	start := workflow.Now(ctx)
	_ = workflow.ExecuteActivity(ctx, activity.Dialer.PreventCpsThrottle, 30).Get(ctx, nil)
	end := workflow.Now(ctx)
	fmt.Printf("[%s] done sleeping (took %v ms)\n", end, end.Sub(start).Milliseconds())
}

Output:
[2023-11-13 04:43:53.361941691 +0000 UTC] sleeping start
[2023-11-13 04:43:53.409735008 +0000 UTC] done sleeping (took 47 ms)
[2023-11-13 04:43:53.409735008 +0000 UTC] sleeping start
[2023-11-13 04:43:53.491488656 +0000 UTC] done sleeping (took 81 ms)
[2023-11-13 04:43:53.491488656 +0000 UTC] sleeping start
[2023-11-13 04:43:53.591618227 +0000 UTC] done sleeping (took 100 ms)
[2023-11-13 04:43:53.591618227 +0000 UTC] sleeping start
[2023-11-13 04:43:53.691441616 +0000 UTC] done sleeping (took 99 ms)
[2023-11-13 04:43:53.691441616 +0000 UTC] sleeping start
[2023-11-13 04:43:53.79112727 +0000 UTC] done sleeping (took 99 ms)

for i := 0; i < 5; i++ {
	start := workflow.Now(ctx)
	fmt.Printf("[%s] sleeping start\n", start)

	timer := workflow.NewTimer(ctx, time.Duration(33*time.Microsecond))
	workflow.Await(ctx, func() bool {
		return timer.IsReady()
	})

	end := workflow.Now(ctx)
	fmt.Printf("[%s] done sleeping (took %v ms)\n", end, end.Sub(start).Milliseconds())
}

Output:
[2023-11-13 05:03:26.887772415 +0000 UTC] sleeping start
[2023-11-13 05:03:26.906969428 +0000 UTC] done sleeping (took 19 ms)
[2023-11-13 05:03:26.906969428 +0000 UTC] sleeping start
[2023-11-13 05:03:27.905998517 +0000 UTC] done sleeping (took 999 ms)
[2023-11-13 05:03:27.905998517 +0000 UTC] sleeping start
[2023-11-13 05:03:28.907473299 +0000 UTC] done sleeping (took 1001 ms)
[2023-11-13 05:03:28.907473299 +0000 UTC] sleeping start
[2023-11-13 05:03:29.907984135 +0000 UTC] done sleeping (took 1000 ms)
[2023-11-13 05:03:29.907984135 +0000 UTC] sleeping start
[2023-11-13 05:03:30.909456751 +0000 UTC] done sleeping (took 1001 ms)

I’m curious about why this might be happening and if there are any insights or best practices for achieving more consistent timing in Temporal workflows. Is there something I should be aware of regarding how Temporal handles sleep and timers that could explain these variations?

Any guidance or suggestions would be greatly appreciated. Thank you for your help!

Hi @PierreVdw, thanks for your post. Yes, I’m able to reproduce this behaviour. Let me dig a little deeper and circle back. Cheers!

Hi @taonic, thank you. I look forward to your feedback.

Hi @PierreVdw sorry about the wait.

After some conversations with our team, I was made aware that sub-second timer (your 1st and 3rd scenario) cannot be achieved reliably due to some implementation details at the queue level. Besides, given how Activity and Workflow Tasks are scheduled, polled, executed and responded between the worker and server there will always be some level of network latency contribute to the overall sleeping time as observed in your 2nd scenario.

To manage rate limit, we typically recommend setting TaskQueueActivitiesPerSecond. You can also refer to other posts for more context on this topic.

To expand on @taonic’s answer. The latency between workflow steps depends on a specific deployment of the Temporal cluster. Your latencies are on the high side. The same test executed against the Temporal cloud would give much lower latencies. But they it still be higher than the 30ms sleep.

Hi @taonic and @maxim,

We had a hunch that this would be due to how activities and workflows work behind the scenes as well as the possible network latency. Just thought we’d post here to confirm.

Thank you kindly for the feedback and time you took to investigate this topic.