Can not schedule workflow with the same ID

Hey, we have deployed the temporal cluster to kubernetes using Temporal Helm Charts + Cassandra with default namespace params.

The workflows execution is generally working - it’s completing successfully, but when I am trying to schedule one more workflow with the same workflow ID - the typescript SDK just hanging without any errors (I am even tried to pass separately workflowIdReusePolicy to allow duplicated the same result).
Then I tried to schedule the same workflow with the same id using tctl with env TEMPORAL_CLI_SHOW_STACKS=1- and getting an error

Error: Failed to create workflow.
Error Details: context deadline exceeded
Stack trace:
goroutine 1 [running]:
runtime/debug.Stack()
	runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
	runtime/debug/stack.go:16 +0x19
github.com/temporalio/tctl/cli_curr.printError({0x2b6d288, 0x1a}, {0x2faa240, 0xc00028c648})
	github.com/temporalio/tctl/cli_curr/util.go:392 +0x21e
github.com/temporalio/tctl/cli_curr.ErrorAndExit({0x2b6d288?, 0xa?}, {0x2faa240?, 0xc00028c648?})
	github.com/temporalio/tctl/cli_curr/util.go:403 +0x28
github.com/temporalio/tctl/cli_curr.startWorkflowHelper.func1()
	github.com/temporalio/tctl/cli_curr/workflowCommands.go:225 +0x1a5
github.com/temporalio/tctl/cli_curr.startWorkflowHelper(0xc00003d760, 0x0)
	github.com/temporalio/tctl/cli_curr/workflowCommands.go:262 +0x557
github.com/temporalio/tctl/cli_curr.StartWorkflow(...)
	github.com/temporalio/tctl/cli_curr/workflowCommands.go:173
github.com/temporalio/tctl/cli_curr.newWorkflowCommands.func3(0xc00003d760?)
	github.com/temporalio/tctl/cli_curr/workflow.go:57 +0x1b
github.com/urfave/cli.HandleAction({0x273c0a0?, 0x2c17e30?}, 0x5?)
	github.com/urfave/cli@v1.22.5/app.go:526 +0x50
github.com/urfave/cli.Command.Run({{0x2b3dc2d, 0x5}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x2b79fe4, 0x1e}, {0x0, ...}, ...}, ...)
	github.com/urfave/cli@v1.22.5/command.go:173 +0x652
github.com/urfave/cli.(*App).RunAsSubcommand(0xc0003236c0, 0xc00003d4a0)
	github.com/urfave/cli@v1.22.5/app.go:405 +0x91b
github.com/urfave/cli.Command.startApp({{0x2b45303, 0x8}, {0x0, 0x0}, {0xc00007fe40, 0x1, 0x1}, {0x2b6abf9, 0x19}, {0x0, ...}, ...}, ...)
	github.com/urfave/cli@v1.22.5/command.go:372 +0x6e7
github.com/urfave/cli.Command.Run({{0x2b45303, 0x8}, {0x0, 0x0}, {0xc00007fe40, 0x1, 0x1}, {0x2b6abf9, 0x19}, {0x0, ...}, ...}, ...)
	github.com/urfave/cli@v1.22.5/command.go:102 +0x808
github.com/urfave/cli.(*App).Run(0xc000323340, {0xc000134000, 0x16, 0x16})
	github.com/urfave/cli@v1.22.5/app.go:277 +0x8a7
main.main()
	./main.go:45 +0xa6

And I am only getting this error if the same workflow ID is used for already completed workflow. If I am using the new one - it’s processing successfully.

There are also some errors in history service logs

{"level":"error","ts":"2022-06-08T15:02:38.737Z","msg":"Persistent store operation Failure","shard-id":216,"address":"172.31.1.248:7234","wf-namespace-id":"32049b68-7872-4094-8e63-d0dd59896a83","wf-id":"temporal-sys-history-scanner","wf-run-id":"bc3ab2dc-e36e-441f-9516-5e735e637b63","store-operation":"create-wf-execution","error":"Encounter shard ownership lost, request range ID: 2801241, actual range ID: 0","logging-call-at":"transaction_impl.go:380","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history/workflow.createWorkflowExecutionWithRetry\n\t/home/builder/temporal/service/history/workflow/transaction_impl.go:380\ngo.temporal.io/server/service/history/workflow.(*ContextImpl).CreateWorkflowExecution\n\t/home/builder/temporal/service/history/workflow/context.go:361\ngo.temporal.io/server/service/history.(*historyEngineImpl).StartWorkflowExecution\n\t/home/builder/temporal/service/history/historyEngine.go:561\ngo.temporal.io/server/service/history.(*Handler).StartWorkflowExecution\n\t/home/builder/temporal/service/history/handler.go:527\ngo.temporal.io/server/api/historyservice/v1._HistoryService_StartWorkflowExecution_Handler.func1\n\t/home/builder/temporal/api/historyservice/v1/service.pb.go:948\ngo.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/rate_limit.go:84\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/telemetry.go:108\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1119\ngo.temporal.io/server/common/metrics.NewServerMetricsTrailerPropagatorInterceptor.func1\n\t/home/builder/temporal/common/metrics/grpc.go:113\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1119\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/home/builder/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1119\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/home/builder/temporal/common/rpc/grpc.go:131\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1119\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1121\ngo.temporal.io/server/api/historyservice/v1._HistoryService_StartWorkflowExecution_Handler\n\t/home/builder/temporal/api/historyservice/v1/service.pb.go:950\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1282\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1619\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:921"}
{"level":"error","ts":"2022-06-08T15:02:38.740Z","msg":"Error complete transfer task","shard-id":216,"address":"172.31.1.248:7234","component":"transfer-queue-processor","error":"shard closed","logging-call-at":"transferQueueProcessor.go:259","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*transferQueueProcessorImpl).completeTransferLoop\n\t/home/builder/temporal/service/history/transferQueueProcessor.go:259"}
{"level":"error","ts":"2022-06-08T15:02:38.764Z","msg":"Persistent store operation Failure","shard-id":216,"address":"172.31.1.248:7234","wf-namespace-id":"32049b68-7872-4094-8e63-d0dd59896a83","wf-id":"temporal-sys-history-scanner","wf-run-id":"b4421186-dc25-4cf6-9947-2012d904b7a2","store-operation":"create-wf-execution","error":"Encounter shard ownership lost, request range ID: 2801242, actual range ID: 0","logging-call-at":"transaction_impl.go:380","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history/workflow.createWorkflowExecutionWithRetry\n\t/home/builder/temporal/service/history/workflow/transaction_impl.go:380\ngo.temporal.io/server/service/history/workflow.(*ContextImpl).CreateWorkflowExecution\n\t/home/builder/temporal/service/history/workflow/context.go:361\ngo.temporal.io/server/service/history.(*historyEngineImpl).StartWorkflowExecution\n\t/home/builder/temporal/service/history/historyEngine.go:561\ngo.temporal.io/server/service/history.(*Handler).StartWorkflowExecution\n\t/home/builder/temporal/service/history/handler.go:527\ngo.temporal.io/server/api/historyservice/v1._HistoryService_StartWorkflowExecution_Handler.func1\n\t/home/builder/temporal/api/historyservice/v1/service.pb.go:948\ngo.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/rate_limit.go:84\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/telemetry.go:108\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1119\ngo.temporal.io/server/common/metrics.NewServerMetricsTrailerPropagatorInterceptor.func1\n\t/home/builder/temporal/common/metrics/grpc.go:113\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1119\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/home/builder/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1119\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/home/builder/temporal/common/rpc/grpc.go:131\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1119\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1121\ngo.temporal.io/server/api/historyservice/v1._HistoryService_StartWorkflowExecution_Handler\n\t/home/builder/temporal/api/historyservice/v1/service.pb.go:950\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1282\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1619\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:921"}
{"level":"error","ts":"2022-06-08T15:02:38.768Z","msg":"Error complete transfer task","shard-id":216,"address":"172.31.1.248:7234","component":"transfer-queue-processor","error":"shard closed","logging-call-at":"transferQueueProcessor.go:259","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*transferQueueProcessorImpl).completeTransferLoop\n\t/home/builder/temporal/service/history/transferQueueProcessor.go:259"}

and matching service

{"level":"error","ts":"2022-06-08T11:48:13.503Z","msg":"Membership upsert failed.","service":"matching","error":"operation UpsertClusterMembership encountered Operation timed out for temporal.cluster_membership - received only 0 responses from 0 CL=LOCAL_QUORUM.","logging-call-at":"rpMonitor.go:284","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/common/membership.(*ringpopMonitor).startHeartbeatUpsertLoop.func1\n\t/home/builder/temporal/common/membership/rpMonitor.go:284"}
{"level":"info","ts":"2022-06-08T12:19:23.848Z","msg":"matching client encountered error","service":"matching","error":"Workflow did not register a handler for __stack_trace","service-error-type":"serviceerror.QueryFailed","logging-call-at":"metricClient.go:268"}
{"level":"info","ts":"2022-06-08T12:19:28.074Z","msg":"matching client encountered error","service":"matching","error":"Workflow did not register a handler for __stack_trace","service-error-type":"serviceerror.QueryFailed","logging-call-at":"metricClient.go:268"}
{"level":"info","ts":"2022-06-08T12:28:51.180Z","msg":"history client encountered error","service":"matching","error":"shard closed","service-error-type":"serviceerror.Unavailable","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.205Z","msg":"history client encountered error","service":"matching","error":"shard closed","service-error-type":"serviceerror.Unavailable","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.232Z","msg":"history client encountered error","service":"matching","error":"shard closed","service-error-type":"serviceerror.Unavailable","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.260Z","msg":"history client encountered error","service":"matching","error":"shard closed","service-error-type":"serviceerror.Unavailable","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.279Z","msg":"history client encountered error","service":"matching","error":"shard closed","service-error-type":"serviceerror.Unavailable","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.295Z","msg":"history client encountered error","service":"matching","error":"shard closed","service-error-type":"serviceerror.Unavailable","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.295Z","msg":"history client encountered error","service":"matching","error":"shard closed","service-error-type":"serviceerror.Unavailable","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.350Z","msg":"history client encountered error","service":"matching","error":"shard closed","service-error-type":"serviceerror.Unavailable","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.397Z","msg":"history client encountered error","service":"matching","error":"shard closed","service-error-type":"serviceerror.Unavailable","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.402Z","msg":"history client encountered error","service":"matching","error":"shard closed","service-error-type":"serviceerror.Unavailable","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.420Z","msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.451Z","msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.453Z","msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.476Z","msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.478Z","msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.487Z","msg":"history client encountered error","service":"matching","error":"shard closed","service-error-type":"serviceerror.Unavailable","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.510Z","msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.512Z","msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.540Z","msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"metricClient.go:638"}
{"level":"info","ts":"2022-06-08T12:28:51.594Z","msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"metricClient.go:638"}

Can you please give some insights what can be wrong in cluster setup that it doesn’t allow to schedule workflow with the same ID?

This is obviously an issue in the Temporal cluster (or the way it’s set up), I’ll let @tihomir take that.
As for the TS SDK, by default it retries until it gets a non retryable error, you can use Connection.withDeadline to limit that.

What server version are you on? Can you make sure your at at least 1.15.2 and test again to eliminate the shard deadlock issue that was fixed in this release.

If that does not fix it, would look into Cassandra issues that could happen during high load if thats the case here. You should “protect” Cassandra via following dynamic configs to not push it over the limit:

frontend.persistenceMaxQPS - frontend persistence max qps
history.persistenceMaxQPS - history persistence max qps
matching.persistenceMaxQPS - matching persistence max qps

Hi @tihomir , we are using 1.16.2, we have set it up just a week ago in dev env only. So there is also no high loads present. We have tested it with a few workflows - it’s working, the only issue we have faced is scheduling workflow with the same workflow ID.

Even though, the workflow with cron schedule is working - it’s starting new workflows with the same ID by cron interval.

Even though, the workflow with cron schedule is working - it’s starting new workflows with the same ID by cron interval.

This is expected with cron, after your first cron execution completes, a new one with same workflow id and different runid is created and firstWorkflowTaskBackoff timer is set according to your defined cron schedule.

Do you get the same issue with the typescript cron sample? If you do I think we should take a look at your server config, and your db setup.

Not sure you understood me correctly. But cron schedule is working for me, even with my own examples (I mean, starting the workflow by cron schedule with new workflow ID, and then it’s rescheduling itself with new run id - it’s working). But when I try myself start the workflow with the same ID (that is already in completed status) - it’s failing as I said above.

Ok, can you show the code you are using to start these workflows (or tctl command if you are using cli) so we can try to reproduce? The default workflow id reuse policy should allow you to create a new workflow with same workflow id if there are no currently running executions with the same (in same namespace).
The errors you showed indicate possible cassandra issues. Can you also share your default and visibility db configs?

sry, for delay. I have tried both, but simpler to show tctl command, here example

TEMPORAL_CLI_SHOW_STACKS=1 tctl --tls_disable_host_verification --address temporal.grpc.dev.int.pay.com:443 workflow start --tq example-queue --wid 8765432128 --wt copyFilesFromSftp -i '"outgoing"'  --wrp AllowDuplicate

Also if it’s worth to mention - we are using ScyllaDB.
DB configs:

server:
  config:
    persistence:
      default:
        driver: "cassandra"
        cassandra:
          hosts: ["192.168.248.11", "192.168.249.160", "192.168.250.11"]
          keyspace: "temporal"
          user: "user"
          password: "password"
          datacenter: "AWS_EU_WEST_1"
          replicationFactor: 3
          disableInitialHostLookup: true
      visibility:
        driver: "cassandra"
        cassandra:
          hosts: ["192.168.248.11", "192.168.249.160", "192.168.250.11"]
          keyspace: "temporal_visibility"
          user: "user"
          password: "password"
          datacenter: "AWS_EU_WEST_1"
          replicationFactor: 3
          disableInitialHostLookup: true

Also, if needed I can provide the db keyspaces set up, but it’s basic scheme creation with your script.
Also, for you to be sure, the workflow is in status completed.

Also, when workflow is reaching Retention Period - I can start the new one with the same workflow ID.

hey @tihomir , friendly reminder about the issue, it’s not resolved yet. Do you have any idea what can cause it?

Can you reproduce the problem using Cassandra as a DB?

You are indeed right, the issue seems to be in ScyllaDB. I have tested with Cassandra - no issue. Even though I don’t understand why it’s failing. All functionality seems to be working, the only issue is with scheduling the workflow with the same ID. We could technically still continue to use it even with this restriction (to generate unique workflowIDs), but I am afraid, that there could be some other unexpected issues.

This issue can be reproduced with using local set up with docker-compose, just update the Cassandra image to scyllaDB.

version: "3.5"
services:
  cassandra:
    container_name: temporal-scylla
    image: scylladb/scylla:4.6.3
    networks:
      - temporal-network
    ports:
      - 9042:9042
  temporal:
    container_name: temporal
    depends_on:
      - cassandra
    environment:
      - CASSANDRA_SEEDS=cassandra
      - DYNAMIC_CONFIG_FILE_PATH=config/dynamicconfig/development-cass.yaml
    image: temporalio/auto-setup:${TEMPORAL_VERSION}
    networks:
      - temporal-network
    ports:
      - 7233:7233
    volumes:
      - ./dynamicconfig:/etc/temporal/config/dynamicconfig
  temporal-admin-tools:
    container_name: temporal-admin-tools
    depends_on:
      - temporal
    environment:
      - TEMPORAL_CLI_ADDRESS=temporal:7233
    image: temporalio/admin-tools:${TEMPORAL_VERSION}
    networks:
      - temporal-network
    stdin_open: true
    tty: true
  temporal-web:
    container_name: temporal-web
    depends_on:
      - temporal
    environment:
      - TEMPORAL_GRPC_ENDPOINT=temporal:7233
      - TEMPORAL_PERMIT_WRITE_API=true
    image: temporalio/web:${TEMPORAL_WEB_VERSION}
    networks:
      - temporal-network
    ports:
      - 8088:8088
networks:
  temporal-network:
    driver: bridge
    name: temporal-network

Maybe you are using some Cassandra specific features, seems like ScyllaDB supports only CQL version 3.4.0 - according to their docs.

The last time we ran tests with ScyllaDB in 2020. Back then it had bugs (or differences with Cassandra) that precluded us from supporting it for Temporal as Cassandra drop in replacement. It could be that ScyllaDB is fine, but some misconfiguration precludes it from working in your case.