Is it a good idea to use Temporal to manage provisioning of its own workers and infrastructure?

I am in the process of implementing a new workflow system, and I am considering using Temporal to manage the provisioning of both the workers and the infrastructure they run on. However, I have been advised against it, and I would like to ask the community for their thoughts on this matter.

My initial thoughts were that since Temporal is designed to handle workflows, it would be a good fit for managing the provisioning of workers and the infrastructure they run on. I was considering using workflows and activities to drive tools like Terraform, Waypoint and other infrastructure provisioning tools.

However, I have been advised against using Temporal in this way for a few reasons. Mainly that it is generally not recommended to use the same system to manage itself.

I would like to know the community’s thoughts on this. Have you used Temporal to manage the provisioning of workers and infrastructure? What has been your experience? Do you have any advice or best practices to share?

I want to clarify my thought process a bit more. I have already created manager workflows to handle the creation of workflows through signals sent by our CLI or API. My idea is that, since Temporal allows us to code workflows as if failure doesn’t exist, even starting a workflow or a worker should be an activity. Starting these processes manually can introduce potential failure points, race conditions, and duplication. Instead, using Temporal’s activities to manage the provisioning of workers and infrastructure could provide a more seamless and reliable solution.

This approach makes a lot of sense. Using Temporal for infra provisioning is a very common use case. Companies like Datadog and Hashicorp rely on it.

1 Like

Thank you for your guidance.

Does it makes sense to add a future to the selector from within the receive callback function? I had assumed that not doing so would block the callback, but I’m not sure if it is an anti-pattern.

Is there an issue with accessing the campaign variable from within the future function?

	workflow.Go(ctx, func(ctx workflow.Context) {
		for {
			selector := workflow.NewSelector(ctx)
			selector.AddReceive(
				workflow.GetSignalChannel(ctx, CampaignCreateSignal),
				func(c workflow.ReceiveChannel, _ bool) {
					var campaign Campaign
					c.Receive(ctx, &campaign)
					log := workflow.GetLogger(ctx)

					var childID string
					if err = workflow.SideEffect(ctx, func(ctx workflow.Context) interface{} {
						return fmt.Sprintf("campaign:%v", uuid.New().String())
					}).Get(&childID); err != nil {
						return
					}

					future := workflow.ExecuteChildWorkflow(
						workflow.WithChildOptions(ctx, workflow.ChildWorkflowOptions{
							WorkflowID: childID,
							TaskQueue:  CampaignTaskQueue,
						}),
						CampaignWorkflow,
						campaign,
					)
					log.Info("Creating campaign workflow", campaign.Name)

					selector.AddFuture(future, func(f workflow.Future) {
						name := campaign.Name
						log := workflow.GetLogger(ctx)
						log.Info("Getting campaign workflow", name)
						err := f.Get(ctx, nil)
						if err != nil {
							log.Error("Campaign creation failed", name, "Error", err)
							return
						}
					})
				})
			selector.Select(ctx)
		}
	})

Check out the Await Signals sample for a simpler pattern.

I am not sure if you are talking about workflows to handle(create, maintain, update) the temporal infrastructure by an temporal workflow itself?

I was also thinking about such a setup, but came to the conclusion that I will build up two instances of temporal for “deadlock”-reasons in case of failures during a rollout of a change.

1 a small temporal instance that only runs the management of the infrastructure and workers for the second instance
2 a bigger temporal cluster that runs my actual workload(managing infrastructure for other it systems) and managing the infrastructure and workers of the first temporal instance.

So updating/changing in our overall temporal-setup would start on the first instance, managed by the second. And if this update is complete the workflows that run on the first instance will roll out the change to the second instance.

Btw: The mentioned workers for instance 2 that are managed by instance 1 are only the workers that are used/needed for the managing of instance 1(so the reverse). Every additional worker that is needed to manage foreign infrastructure are managed by instance 2 or in some cases manually.

Now I again feel like being a character in inception, the movie. Or like Baron MĂĽnchhausen in the swamp scene.

Every thought on this is very appreciated.