Hi,
After few experiments with the schedules we have managed to break the sandbox cluster. As it is a sandbox we can re-create it so not a big deal where it may be helpful in case there is bug .
It all started with us unsetting the publicClient ( Temporal worker failing to connect to frontend in 1.18.2 post removing publicClient from config - Community Support - Temporal) and it does not seem to handle the TLS enabled cluster.
We were fast to find out so we have put back the publicClient and worker started fine.
We have created a schedule in this configuration which didn’t seem to be able to fire cause scheduler can’t connect to a frontend cause it is likely now using the new discovery feature hence Client SDK connection of the system-worker is failing with a TLS handshake. We had the same pattern where scheduler was taking a hostport data from the cluster metadata.
We have decided to delete the schedule cause it was not working anyway… after that schedule was deleted half-way meaning we were still able to see the schedule in UI but couldn’t access any schedule data nor schedule was visible via the tctl. Formally the schedule was deleted but it still left some traces so we went further.
Worker log was full of the messages related to the scheduler (seem half-deleted schedule didn’t presented required object so it was referencing nil)
{"level":"error","ts":"2022-10-21T13:42:04.661Z","msg":"Workflow panic","service":"worker","Namespace":"sandbox","TaskQueue":"temporal-sys-per-ns-tq","WorkerID":"server-worker@1@temporal-worker-8cb96b6b-xbb65@sandbox","WorkflowType":"temporal-sys-scheduler-workflow","WorkflowID":"temporal-sys-scheduler:GreetSchedule","RunID":"d271ec47-5593-44c5-82c8-112ac1970382","Attempt":25,"Error":"assignment to entry in nil map","StackTrace":"coroutine root [panic]:\ngo.temporal.io/server/service/worker/scheduler.(*scheduler).addSearchAttributes(0xc0001dac40, 0x18?, {0xc002b972f0?, 0x2?, 0x0?})\n\t/go/pkg/mod/go.temporal.io/server@v1.18.2/service/worker/scheduler/workflow.go:828 +0x127\ngo.temporal.io/server/service/worker/scheduler.(*scheduler).startWorkflow(0xc0001dac40, 0xc0037fb4a0, 0xc0049ca980)\n\t/go/pkg/mod/go.temporal.io/server@v1.18.2/service/worker/scheduler/workflow.go:796 +0x391\ngo.temporal.io/server/service/worker/scheduler.(*scheduler).processBuffer(0xc0001dac40)\n\t/go/pkg/mod/go.temporal.io/server@v1.18.2/service/worker/scheduler/workflow.go:706 +0x565\ngo.temporal.io/server/service/worker/scheduler.(*scheduler).run(0xc0001dac40)\n\t/go/pkg/mod/go.temporal.io/server@v1.18.2/service/worker/scheduler/workflow.go:204 +0x765\ngo.temporal.io/server/service/worker/scheduler.SchedulerWorkflow({0x26a6cd8?, 0xc0038d6960}, 0xc0007f60a0)\n\t/go/pkg/mod/go.temporal.io/server@v1.18.2/service/worker/scheduler/workflow.go:153 +0x195\nreflect.Value.call({0x1dd4bc0?, 0x227f598?, 0x7fa04ff8f5b8?}, {0x2195e42, 0x4}, {0xc0038d6990, 0x2, 0x203000?})\n\t/usr/local/go/src/reflect/value.go:556 +0x845\nreflect.Value.Call({0x1dd4bc0?, 0x227f598?, 0x404dec?}, {0xc0038d6990, 0x2, 0x2})\n\t/usr/local/go/src/reflect/value.go:339 +0xbf\ngo.temporal.io/sdk/internal.executeFunction({0x1dd4bc0, 0x227f598}, {0xc0007f6120, 0x2, 0xc00031fc00?})\n\t/go/pkg/mod/go.temporal.io/sdk@v1.17.0/internal/internal_worker.go:1651 +0x136\ngo.temporal.io/sdk/internal.(*workflowEnvironmentInterceptor).ExecuteWorkflow(0xc00492cb90, {0x26a6b88?, 0xc0048b83c0}, 0xc003e04b88)\n\t/go/pkg/mod/go.temporal.io/sdk@v1.17.0/internal/workflow.go:504 +0x166\ngo.temporal.io/sdk/internal.(*workflowExecutor).Execute(0xc002afc0c0, {0x26a6b88, 0xc0048b83c0}, 0x25?)\n\t/go/pkg/mod/go.temporal.io/sdk@v1.17.0/internal/internal_worker.go:771 +0x292\ngo.temporal.io/sdk/internal.(*syncWorkflowDefinition).Execute.func1({0x26a6cd8, 0xc0038d67e0})\n\t/go/pkg/mod/go.temporal.io/sdk@v1.17.0/internal/internal_workflow.go:553 +0xcd","logging-call-at":"internal_task_handlers.go:955"}
After that we decided to get rid of the problem as although we couldn’t see ‘temporal-sys-scheduler:GreetSchedule’ we could describe it so we issued a delete and it was gone
1st thing we did is deleted a visibility record to get-rid of the schedule in UI and it had a desired outcome
2nd we deleted the workflow
tctl --headers-provider-plugin tctl-authorization-plugin --address "${TEMPORAL_FRONTEND}:443" --tls-ca-path ./ca.pem -n sandbox workflow delete -w temporal-sys-scheduler:GreetSchedule
After this history service started to fail and cluster went cold
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a132b9]
goroutine 640728 [running]:
go.temporal.io/server/service/history/replication.(*ackMgrImpl).processNewRunReplication(0x1?, {0x26a49a0?, 0xc05a51cff0?}, {0xc0369d3410?, 0x26919e0?}, {0xc0369d3440?, 0x1?}, {0x0?, 0xedae84b4a?}, {0x0, ...}, ...)
/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/replication/ack_manager.go:610 +0x79
go.temporal.io/server/service/history/replication.(*ackMgrImpl).generateHistoryReplicationTask(0xc0a7a7ee70, {0x26a49a0, 0xc05a51cff0}, 0xc06b99f550)
/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/replication/ack_manager.go:481 +0x1b3
go.temporal.io/server/service/history/replication.(*ackMgrImpl).toReplicationTask(0xc097119818?, {0x26a49a0?, 0xc05a51cff0?}, {0x26b0630?, 0xc06b99f550?})
/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/replication/ack_manager.go:353 +0xa5
go.temporal.io/server/service/history/replication.(*ackMgrImpl).getTasks(0xc0a7a7ee70, {0x26a49a0, 0xc05a51cff0}, 0x4f00013, 0x610fffff, 0x19)
/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/replication/ack_manager.go:293 +0x425
go.temporal.io/server/service/history/replication.(*ackMgrImpl).GetTasks(0xc0a7a7ee70, {0x26a49a0, 0xc05a51cff0}, {0xc03901bc00, 0x1b}, 0x3?)
/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/replication/ack_manager.go:222 +0x85
go.temporal.io/server/service/history.(*historyEngineImpl).GetReplicationMessages(0xc0a8ce41c0, {0x26a49a0?, 0xc05a51cff0?}, {0xc03901bc00?, 0x1b?}, 0x4f00013?, {0xc05a9556c0?, 0xc05a955860?, 0x0?}, 0x4f00013)
/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/historyEngine.go:1673 +0x3a4
go.temporal.io/server/service/history.(*Handler).GetReplicationMessages.func1(0xc00c4e9ae0)
/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/handler.go:1439 +0x589
created by go.temporal.io/server/service/history.(*Handler).GetReplicationMessages
/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/handler.go:1425 +0x28e
We can keep this running during this week in case any data team like to collect as it is going to be a sandbox rebuild for us
Thanks,
A