RuntimeError: Failed validating workflow <workflow name>

sourcetransformer · March 22, 2023, 7:45pm

Hi there, I have an asynchronous python worker working - calling a very simple pytorch python script and running it.

However, I’ve also provided a synchronous worker. I was able to get my synchronous worker working eventually by bypassing the sandbox validation on the import of my python script which has a dependency on pytorch.

While setting up the synchronous worker - I noticed two things:

It seems that in the case of a synchronous worker - where you provide “activity_executor” (AKA runner) - the sandbox validation step is triggered. Is it intended that this validation only occurs for synchronous workers that provide an activity_executor? I was a bit surprised that I didn’t run into this issue in either of the asynchronous workers I created (which import the same python script with a dependency on pytorch)
The sandbox validation code seems to fail on circular dependencies when it encounters the same instance of a docstring - I presume the validation code just needs to deal with cycles in dependencies or is this “as designed”?

I’m not certain why this error is happening - possibly the result of some kind of a circular dependency in pytorch - but as I say - I’m also not certain why this only crops up in the synchronous worker. It appears to be due to the following validation step:

class _WorkflowWorker:
    def __init__(
        self,
[SNIP]
    ) -> None:
[SNIP] 
            # Prepare the workflow with the runner (this will error in the
            # sandbox if an import fails somehow)
            try:
                if defn.sandboxed:
                    workflow_runner.prepare_workflow(defn)
                else:
                    unsandboxed_workflow_runner.prepare_workflow(defn)
            except Exception as err:
                raise RuntimeError(f"Failed validating workflow {defn.name}") from err
            self._workflows[defn.name] = defn

As I say - I’m able to execute the same python script from within my asynchronous temporal worker - the difference appears to be that the synchronous worker needs to provide an “activity_executor” (AKA runner). I did try testing whether I could repro the same error in my asynchronous worker by unnecessarily providing an activity_executor - but I wasn’t able to get the same error. The callstack that points to the above code is here:

Traceback (most recent call last):
  File "/workspaces/go-run-ml/python/run_ml_worker_manager2.py", line 126, in <module>
    asyncio.run(main2())
  File "/usr/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/workspaces/go-run-ml/python/run_ml_worker_manager2.py", line 98, in main2
    worker = Worker(
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/_worker.py", line 263, in __init__
    self._workflow_worker = _WorkflowWorker(
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/_workflow.py", line 112, in __init__
    raise RuntimeError(f"Failed validating workflow {defn.name}") from err
RuntimeError: Failed validating workflow MachineLearningWorkflow

The above validation callstack (I presume the code is traversing all imports to perform its validation) results in the following error callstack:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/_workflow.py", line 108, in __init__
    workflow_runner.prepare_workflow(defn)
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_runner.py", line 53, in prepare_workflow
    self.create_instance(
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_runner.py", line 87, in create_instance
    return _Instance(det, self._runner_class, self._restrictions)
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_runner.py", line 107, in __init__
    self._create_instance()
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_runner.py", line 118, in _create_instance
    self._run_code(
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_runner.py", line 160, in _run_code
    exec(code, self.globals_and_locals, self.globals_and_locals)
  File "<string>", line 2, in <module>
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_importer.py", line 441, in __call__
    return self.current(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_importer.py", line 232, in _import
    new_spec.loader.exec_module(new_mod)
  File "<frozen importlib._bootstrap_external>", line 790, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/workspaces/go-run-ml/python/run_ml_worker_manager2.py", line 22, in <module>
    from ml_activity import MachineLearningActivity
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_importer.py", line 441, in __call__
    return self.current(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_importer.py", line 234, in _import
    mod = importlib.__import__(name, globals, locals, fromlist, level)
  File "<frozen importlib._bootstrap>", line 1109, in __import__
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 790, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/workspaces/go-run-ml/python/ml_activity.py", line 4, in <module>
    import ml.pytorch.char_rnn2.CharRnnTrain as CharRnnTrain
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_importer.py", line 441, in __call__
    return self.current(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_importer.py", line 234, in _import
    mod = importlib.__import__(name, globals, locals, fromlist, level)
  File "<frozen importlib._bootstrap>", line 1109, in __import__
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 790, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/workspaces/go-run-ml/python/ml/pytorch/char_rnn2/CharRnnTrain.py", line 4, in <module>
    import torch
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_importer.py", line 441, in __call__
    return self.current(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_importer.py", line 234, in _import
    mod = importlib.__import__(name, globals, locals, fromlist, level)
  File "<frozen importlib._bootstrap>", line 1109, in __import__
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 790, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/usr/local/lib/python3.9/dist-packages/torch/__init__.py", line 675, in <module>
    from ._tensor import Tensor
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_importer.py", line 441, in __call__
    return self.current(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_importer.py", line 234, in _import
    mod = importlib.__import__(name, globals, locals, fromlist, level)
  File "<frozen importlib._bootstrap>", line 1113, in __import__
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 790, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/usr/local/lib/python3.9/dist-packages/torch/_tensor.py", line 21, in <module>
    from torch.overrides import (
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_importer.py", line 441, in __call__
    return self.current(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/temporalio/worker/workflow_sandbox/_importer.py", line 234, in _import
    mod = importlib.__import__(name, globals, locals, fromlist, level)
  File "<frozen importlib._bootstrap>", line 1109, in __import__
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 790, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/usr/local/lib/python3.9/dist-packages/torch/overrides.py", line 1548, in <module>
    has_torch_function = _add_docstr(
RuntimeError: function '_has_torch_function' already has a docstring

I found a way to manage this when I read through the documentation here:

I initially tried dealing with this by using this argument to the Worker:

        workflow_runner=SandboxedWorkflowRunner(
            restrictions=SandboxRestrictions.default.with_passthrough_modules("torch")
        ),

But, this causes the error:

Cannot access pathlib.Path.mkdir.call from inside a workflow. If this is code from a module not used in a workflow or known to only be used deterministically from a workflow, mark the import as pass through.

So I just imported my python script that has a dependency on pytorch like this:

with workflow.unsafe.imports_passed_through():
    import ml.pytorch.char_rnn2.CharRnnTrain as CharRnnTrain

and it seems to be working now.

Chad_Retz · March 22, 2023, 9:08pm

Sandbox validation occurs on every file a workflow is in by default. You should make sure to pass through any third party modules (or even better, make your workflow a separate module/file from the rest of your code and pass through imports for the activities that you reference in the workflows).

Can you provide a complete standalone replication? Also are you passing through imports for non-stdlib/non-temporalio modules in files that contain workflows? This would include libraries like Pytorch and others. We strongly recommend you do this so they are not loaded in the sandbox. See related README sections here and here.

Yes! This is the recommended way to import third party dependencies. You can also make your workflows separate files and only import the limited things you need (still passing them through).

sourcetransformer · March 22, 2023, 11:36pm

Can you provide a complete standalone replication? Also are you passing through imports for non-stdlib/non-temporalio modules in files that contain workflows? This would include libraries like Pytorch and others. We strongly recommend you do this so they are not loaded in the sandbox. See related README sections here (had to delete one link because new users can only post two lines) and here.

Sure - I created a branch from a fork here:
https://github.com/source-transformer/temporal-samples-python/tree/issues/tpc/sandboxImportTraversalFailureOnPytorch

If you go to the “hello” directory and run:

python3 hello_async_activity_completion.py

you’ll see the error. Be sure to install “torch” using pip, pip3, conda or whatever you use to manage your python packages.

You also mentioned that sandbox validation occurs on every file a workflow is in by default. It appears to be more “aggressive” than just files that contain your workflow - since the following is the entirety of the file containing my activity that generated the error (before I updated the import):

import asyncio
from temporalio import activity

import torch

@activity.defn
def MachineLearningActivity(name: str) -> str:

    return f"Hello, {name}!"

I’m not sure if workflow is imported indirectly/transitively when you import activity - so I’m not sure if this still fits in with your description (i.e. as intended/designed).

Thanks again!

Chad_Retz · March 23, 2023, 12:10pm

You should change the import to:

with workflow.unsafe.imports_passed_through():
    import torch

We import the workflow file in the sandbox very simply. Anything that is not passed through from outside the sandbox is subject to the same import recursively (and the importer is custom written with many caveats to protect workflows). Standard library and temporalio library imports are already passed through by default, but we recommend marking any others as pass through (we don’t pass through all by default because some workflows may span multiple files and we want to be safe by default).

Also, those “hello” files are demonstrations. Two things to note - first, for anything non-trivial, workflows are best as separate files/modules that just reference activities instead of in the same file. The reason is we run the entire workflow file/module in a sandbox, so you might as well isolate it too. We should probably update our samples here. Second, async activity completion is rarely needed. I would strongly suggest not using it unless you are sure you need it. Long running activities are fine, as are activities that start something, complete immediately, then send a signal back to the workflow upon completion of some external thing.

sourcetransformer · March 23, 2023, 12:47pm

Ah - sorry Chad - I misunderstood when you wrote this:

and this:

I took that to mean that a file containing your workflow and a file containing your activity that import the same module would be “treated differently” - so I thought you were saying I could import the third party module without the “passed_through” technique if I separated it out. The “passed_through” method works for me - I had just thought you were suggesting another way I could import the third party module without using the “passed_through” method.

My activity needs to refererence the third party modules - since the activity will be kicking off the long running process.

I need to run a few more tests - but I’m now able to pass a generic callback to a synchronous long running python script with a dependency on pytorch now. I’ll write a couple of similar tests for the asynchronous activity later today.

As for async activity completion - I personally like this pattern better - but I was asked to provide both a synchronous and asynchronous worker (technically activity) to ensure I support both styles of underlying python script.

Also - I got a bit sidetracked with the other topics that were raised - but the original reason I created this post is because I already had a way to avoid the error I was getting when I was importing pytorch using the “passed_through” technique - but I was actually checking to see if this error the “sandbox validation” was reporting was valid?

I.E. it seems as if the error I mentioned is a result of the sandbox validation not being able to handle cycles:

  File "/usr/local/lib/python3.9/dist-packages/torch/overrides.py", line 1548, in <module>
    has_torch_function = _add_docstr(
RuntimeError: function '_has_torch_function' already has a docstring

Thanks again!

Chad_Retz · March 23, 2023, 10:32pm

If you do use this pattern, you may want to make sure you heartbeat (as we recommend for all long-running activities).

It may be valid, I would have to debug into why our sandboxed importer can’t import it (there are some known issues). The importer can handle cycles, but it’s likely one of your libraries doesn’t support being reloaded. This is similar to a numpy bug because numpy didn’t support reloading. I bet pytorch has the same problem (here’s only issue I found about it).

But we recommend passing through the module anyways like you’re doing so that these modules don’t have to be reloaded.

Topic		Replies	Views
Structuring sandbox friendly activity imports when using the Python SDK Community Support python-sdk	8	2671	December 13, 2022
Timeline for python-client support Community Support python-sdk	41	7877	April 19, 2022
Temporal + Selenium non-deterministic error Python SDK Community Support python-sdk	1	1274	February 13, 2023
Workflow sandbox throws error due to HTTPConnection.__mro_entries__ Community Support python-sdk , general-impl	2	1879	April 12, 2023
Bug: workflow_sandbox imports and re-initizates extension modules Community Support python-sdk	6	597	January 8, 2024

RuntimeError: Failed validating workflow <workflow name>

Related topics