Deleting the schedule broke the cluster in 1.18.2

Hi,

After few experiments with the schedules we have managed to break the sandbox cluster. As it is a sandbox we can re-create it so not a big deal where it may be helpful in case there is bug .

It all started with us unsetting the publicClient ( Temporal worker failing to connect to frontend in 1.18.2 post removing publicClient from config - Community Support - Temporal) and it does not seem to handle the TLS enabled cluster.

We were fast to find out so we have put back the publicClient and worker started fine.

We have created a schedule in this configuration which didn’t seem to be able to fire cause scheduler can’t connect to a frontend cause it is likely now using the new discovery feature hence Client SDK connection of the system-worker is failing with a TLS handshake. We had the same pattern where scheduler was taking a hostport data from the cluster metadata.

We have decided to delete the schedule cause it was not working anyway… after that schedule was deleted half-way meaning we were still able to see the schedule in UI but couldn’t access any schedule data nor schedule was visible via the tctl. Formally the schedule was deleted but it still left some traces so we went further.

Worker log was full of the messages related to the scheduler (seem half-deleted schedule didn’t presented required object so it was referencing nil)

{"level":"error","ts":"2022-10-21T13:42:04.661Z","msg":"Workflow panic","service":"worker","Namespace":"sandbox","TaskQueue":"temporal-sys-per-ns-tq","WorkerID":"server-worker@1@temporal-worker-8cb96b6b-xbb65@sandbox","WorkflowType":"temporal-sys-scheduler-workflow","WorkflowID":"temporal-sys-scheduler:GreetSchedule","RunID":"d271ec47-5593-44c5-82c8-112ac1970382","Attempt":25,"Error":"assignment to entry in nil map","StackTrace":"coroutine root [panic]:\ngo.temporal.io/server/service/worker/scheduler.(*scheduler).addSearchAttributes(0xc0001dac40, 0x18?, {0xc002b972f0?, 0x2?, 0x0?})\n\t/go/pkg/mod/go.temporal.io/server@v1.18.2/service/worker/scheduler/workflow.go:828 +0x127\ngo.temporal.io/server/service/worker/scheduler.(*scheduler).startWorkflow(0xc0001dac40, 0xc0037fb4a0, 0xc0049ca980)\n\t/go/pkg/mod/go.temporal.io/server@v1.18.2/service/worker/scheduler/workflow.go:796 +0x391\ngo.temporal.io/server/service/worker/scheduler.(*scheduler).processBuffer(0xc0001dac40)\n\t/go/pkg/mod/go.temporal.io/server@v1.18.2/service/worker/scheduler/workflow.go:706 +0x565\ngo.temporal.io/server/service/worker/scheduler.(*scheduler).run(0xc0001dac40)\n\t/go/pkg/mod/go.temporal.io/server@v1.18.2/service/worker/scheduler/workflow.go:204 +0x765\ngo.temporal.io/server/service/worker/scheduler.SchedulerWorkflow({0x26a6cd8?, 0xc0038d6960}, 0xc0007f60a0)\n\t/go/pkg/mod/go.temporal.io/server@v1.18.2/service/worker/scheduler/workflow.go:153 +0x195\nreflect.Value.call({0x1dd4bc0?, 0x227f598?, 0x7fa04ff8f5b8?}, {0x2195e42, 0x4}, {0xc0038d6990, 0x2, 0x203000?})\n\t/usr/local/go/src/reflect/value.go:556 +0x845\nreflect.Value.Call({0x1dd4bc0?, 0x227f598?, 0x404dec?}, {0xc0038d6990, 0x2, 0x2})\n\t/usr/local/go/src/reflect/value.go:339 +0xbf\ngo.temporal.io/sdk/internal.executeFunction({0x1dd4bc0, 0x227f598}, {0xc0007f6120, 0x2, 0xc00031fc00?})\n\t/go/pkg/mod/go.temporal.io/sdk@v1.17.0/internal/internal_worker.go:1651 +0x136\ngo.temporal.io/sdk/internal.(*workflowEnvironmentInterceptor).ExecuteWorkflow(0xc00492cb90, {0x26a6b88?, 0xc0048b83c0}, 0xc003e04b88)\n\t/go/pkg/mod/go.temporal.io/sdk@v1.17.0/internal/workflow.go:504 +0x166\ngo.temporal.io/sdk/internal.(*workflowExecutor).Execute(0xc002afc0c0, {0x26a6b88, 0xc0048b83c0}, 0x25?)\n\t/go/pkg/mod/go.temporal.io/sdk@v1.17.0/internal/internal_worker.go:771 +0x292\ngo.temporal.io/sdk/internal.(*syncWorkflowDefinition).Execute.func1({0x26a6cd8, 0xc0038d67e0})\n\t/go/pkg/mod/go.temporal.io/sdk@v1.17.0/internal/internal_workflow.go:553 +0xcd","logging-call-at":"internal_task_handlers.go:955"}

After that we decided to get rid of the problem as although we couldn’t see ‘temporal-sys-scheduler:GreetSchedule’ we could describe it so we issued a delete and it was gone

1st thing we did is deleted a visibility record to get-rid of the schedule in UI and it had a desired outcome

2nd we deleted the workflow

tctl --headers-provider-plugin tctl-authorization-plugin --address "${TEMPORAL_FRONTEND}:443" --tls-ca-path ./ca.pem -n sandbox workflow delete -w temporal-sys-scheduler:GreetSchedule

After this history service started to fail and cluster went cold

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a132b9]

goroutine 640728 [running]:
go.temporal.io/server/service/history/replication.(*ackMgrImpl).processNewRunReplication(0x1?, {0x26a49a0?, 0xc05a51cff0?}, {0xc0369d3410?, 0x26919e0?}, {0xc0369d3440?, 0x1?}, {0x0?, 0xedae84b4a?}, {0x0, ...}, ...)
	/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/replication/ack_manager.go:610 +0x79
go.temporal.io/server/service/history/replication.(*ackMgrImpl).generateHistoryReplicationTask(0xc0a7a7ee70, {0x26a49a0, 0xc05a51cff0}, 0xc06b99f550)
	/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/replication/ack_manager.go:481 +0x1b3
go.temporal.io/server/service/history/replication.(*ackMgrImpl).toReplicationTask(0xc097119818?, {0x26a49a0?, 0xc05a51cff0?}, {0x26b0630?, 0xc06b99f550?})
	/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/replication/ack_manager.go:353 +0xa5
go.temporal.io/server/service/history/replication.(*ackMgrImpl).getTasks(0xc0a7a7ee70, {0x26a49a0, 0xc05a51cff0}, 0x4f00013, 0x610fffff, 0x19)
	/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/replication/ack_manager.go:293 +0x425
go.temporal.io/server/service/history/replication.(*ackMgrImpl).GetTasks(0xc0a7a7ee70, {0x26a49a0, 0xc05a51cff0}, {0xc03901bc00, 0x1b}, 0x3?)
	/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/replication/ack_manager.go:222 +0x85
go.temporal.io/server/service/history.(*historyEngineImpl).GetReplicationMessages(0xc0a8ce41c0, {0x26a49a0?, 0xc05a51cff0?}, {0xc03901bc00?, 0x1b?}, 0x4f00013?, {0xc05a9556c0?, 0xc05a955860?, 0x0?}, 0x4f00013)
	/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/historyEngine.go:1673 +0x3a4
go.temporal.io/server/service/history.(*Handler).GetReplicationMessages.func1(0xc00c4e9ae0)
	/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/handler.go:1439 +0x589
created by go.temporal.io/server/service/history.(*Handler).GetReplicationMessages
	/go/pkg/mod/go.temporal.io/server@v1.18.2/service/history/handler.go:1425 +0x28e

We can keep this running during this week in case any data team like to collect as it is going to be a sandbox rebuild for us

Thanks,
A

Thanks for reporting, shared this with server team too.

For publicClient yes thats the workaround for now.
Can you check in your custom authorizer if it might be rejecting the schedule workers api calls?

Do you configure standard or advanced (ES) visibility? Was told that if you use advanced vis it should help with the ui part you described.

Regarding the history service panic, need to reproduce this, should not happen.

we have advanced visibility configured, when we deleted the schedule it was not removed from the visibility in our case hence we have deleted the doc record.

will check the authoriser, it depends where it lands though as system-worker uses the internode mTLS so it should only do the client auth.

Question about the scheduler workflow panic: are you using our pre-built binaries/docker images or building your own?

Visibility: As you noticed, the schedule corresponds to a workflow, and that workflow is terminated (not deleted) when you delete the schedule. So it does “leave a trace”, but it’s harmless. The workflow isn’t running and there shouldn’t be new workflow tasks. (If you did see new workflow tasks after deleting the schedule that would be a bug.)

To confirm, you’re saying that after you deleted the schedule, tctl schedule list returned nothing, but the schedule list page in the web UI shows the schedule still there?

We’re looking into the authorizer issue.

After we deleted the scheduler workflow kept running with a scheduler frequency spawning the errors in the system worker log like the one in the top message hence us wanted to get rid of it.

Tctl returned nothing but the workflow was in the UI till we deleted the record from the visibility (ES) but it didn’t stopped the workflow so we deleted a workflow as well.

We are implementing the custom authoriser (nothing else is custom) for which we are building our own image which is primarily based on the temporal docker baseline and a temporal server module (1.18.2 in the given scenario). Our custom authoriser accepts both JWT and TLS paths where all the server component (including the system worker) are wired with MTLS and Client Workers are connecting via JWT. Frontend TLS have a host override configured for the MTLS route and another one for the JWT plus our default route is also MTLS so we have all scenarios handled.

We also have schedules working and passing the authorisation in the 1.17.4 for which we had to update the cluster metadata to point to the correct frontend host:port. It can be with the 1.18.2 new service discovery scheduler connections were connecting to the IP:PORT triggering a host validation issue for the TLS scenario as the main change was in the way frontend endpoints are discovered.

1.18.3 uses different client provider (bug fix) which now works when a publicClient is not specified and if scheduler inherits the same logic it may work as well.

We have tested 1.18.4 and the panic is now fixed, our broken sandbox was able to recover.

Schedules are also working fine with the publicClient removed and mTLS enabled cluster.

Thanks for fixing things!