Cannot terminate workflow and all workflow stuck

{“level”:“error”,“ts”:“2022-12-01T12:13:59.787Z”,“msg”:“Operation failed with internal error.”,“error”:“AppendHistoryNodes: mssql: Could not allocate space for object ‘Temporal.history_node’.‘PK__history___DE8D8FB47C38DD1B’ in database ‘xxxx’ because the ‘PRIMARY’ filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.”,“metric-scope”:7,“logging-call-at”:“persistenceMetricClients.go:1424”,“stacktrace”:"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/tmp/go/src/workspace/icg-msst-salespipeline-175611/icg-msst-salespipeline-175611-temporal-service.master/common/log/zap_logger.go

cannot terminate workflow and all workflow stuck, DEADLINE_EXCEEDED in poller thread Workflow Poller this error in client

Hello @keira

It looks like an error related to the DB, I have found some links that seem related to the problem you are facing.

Let me know if it helps,

Thanks. And workflow request is return 504 and all workflow is stuck.

Hello @keira

is this happening after you fix the database issue?

Hi, I would like to piggy back on this issue.
Running into DEADLINE_EXCEEDED when trying to terminate workflows, after spinning up a flood of them accidentally. The UI doesn’t load the details for each of those workflows. CLI throws DEADLINE_EXCEEDED error like the following

tctl --namespace core wf term -w <wf_id>
Error: Terminate workflow failed.
Error Details: context deadline exceeded
Stack trace:
goroutine 1 [running]:
runtime/debug.Stack(0xd, 0x0, 0x0)
/usr/local/go/src/runtime/debug/stack.go:24 +0x9f
runtime/debug.PrintStack()
/usr/local/go/src/runtime/debug/stack.go:16 +0x25
go.temporal.io/server/tools/cli.printError(0x1fb2ec0, 0x1a, 0x22857e0, 0xc00000c258)
/temporal/tools/cli/util.go:394 +0x2be
go.temporal.io/server/tools/cli.ErrorAndExit(0x1fb2ec0, 0x1a, 0x22857e0, 0xc00000c258)
/temporal/tools/cli/util.go:405 +0x49
go.temporal.io/server/tools/cli.TerminateWorkflow(0xc0007c6580)
/temporal/tools/cli/workflowCommands.go:477 +0x278
go.temporal.io/server/tools/cli.newWorkflowCommands.func7(0xc0007c6580)
/temporal/tools/cli/workflow.go:126 +0x2b
github.com/urfave/cli.HandleAction(0x1bdacc0, 0x203a6b0, 0xc0007c6580, 0xc0007c6580, 0x0)
/go/pkg/mod/github.com/urfave/cli@v1.22.5/app.go:526 +0x59
github.com/urfave/cli.Command.Run(0x1f90d21, 0x9, 0x0, 0x0, 0xc00072c9a0, 0x1, 0x1, 0x1fc8bb3, 0x22, 0x0, …)
/go/pkg/mod/github.com/urfave/cli@v1.22.5/command.go:173 +0x579
github.com/urfave/cli.(*App).RunAsSubcommand(0xc000413500, 0xc0007c62c0, 0x0, 0x0)
/go/pkg/mod/github.com/urfave/cli@v1.22.5/app.go:405 +0x914
github.com/urfave/cli.Command.startApp(0x1f8f0e7, 0x8, 0x0, 0x0, 0xc00072cd40, 0x1, 0x1, 0x1fb0813, 0x19, 0x0, …)
/go/pkg/mod/github.com/urfave/cli@v1.22.5/command.go:372 +0x7ff
github.com/urfave/cli.Command.Run(0x1f8f0e7, 0x8, 0x0, 0x0, 0xc00072cd40, 0x1, 0x1, 0x1fb0813, 0x19, 0x0, …)
/go/pkg/mod/github.com/urfave/cli@v1.22.5/command.go:102 +0x9d4
github.com/urfave/cli.(*App).Run(0xc000413180, 0xc00003a070, 0x7, 0x7, 0x0, 0x0)
/go/pkg/mod/github.com/urfave/cli@v1.22.5/app.go:277 +0x808
main.main()
/temporal/cmd/tools/cli/main.go:37 +0x4e

I have reviewed Troubleshooting Issues with the TypeScript SDK | Legacy documentation for Temporal SDKs
Are there any emergency ways to terminate workflows?
Anything I can do to prevent that in the future?

Can you see this exec in primary persistence?

tctl wf desc -w <wfid>

context deadline exceeded
Check cluster stability, maybe start off with health checks

grpc-health-probe -addr=localhost:7233 -service=temporal.api.workflowservice.v1.WorkflowService
grpc-health-probe -addr=localhost:7235 -service=temporal.api.workflowservice.v1.MatchingService
grpc-health-probe -addr=localhost:7234 -service=temporal.api.workflowservice.v1.HistoryService

(change localhost to appropriate ips/hostname)

Thanks, I checked health only with tctl cluster health command and reviewed metrics in grafana. Nothing was looking unhealthy. I have restarted all temporal pods to see if it makes any difference. It did not. We ended up dropping workflows at the db, which did the trick, but obviously that shouldn’t be the way to do it.