Temporal Application is not establishing connection to Temporal Server inside Kubernetes

Hi all,

Note: found the (trivial) reason myself. See at the end of this item. Sorry for bothering. However, I’d like to keep the text as is, b/c someone else might find similar symptoms.


I’m also getting these error logs with “context deadline exceeded” after 70s at PollWorkflowTaskQueue and PollActivityTaskQueue in tag v1.16.1 of temporal, and I’m completely lost on what wait operation causes the 70s timeout.

{"level":"error","ts":"2022-05-02T20:04:30.602+0200","msg":"Unable to call matching.PollWorkflowTaskQueue.","service":"frontend","wf-task-queue-name":"LPOD:22bb3bbe-f6a3-4fe9-8a67-835b2898380c","timeout":"1m9.997997s","error":"context deadline exceeded","logging-call-at":"workflowHandler.go:921", ...

intermixed with these, bursted and probably only varying over all available “wf-task-queue-name”-s:

{"level":"error","ts":"2022-05-02T20:04:30.602+0200","msg":"Unable to call matching.PollActivityTaskQueue.","service":"frontend","wf-task-queue-name":"/_sys/temporal-sys-tq-scanner-taskqueue-0/3","timeout":"1m10s","error":"context deadline exceeded","logging-call-at":"workflowHandler.go:1167", ...

Background

The persistence store was virgin, I added the namespace “default”, then successfully started and finished exactly one workflow and then the temporal server went idle for multiple hours. Spontaneously, i.e. no worker processes were running, these errors were logged, so I suspect some housekeeping work of the service itself is the trigger for inspecting the task queues.

Configuration is development.yaml with logging level changed to “error” and I’m running temporal with my own SQL Server driver. The stack trace is not referring to my code-base directly, but of course it can be a DB operation causing the timeout in a statement right before the logged stack trace. Which one?

So what’s my issue here?

I’d like to debug the error and check if it’s connected with my driver code, so I want to reproduce it without waiting several hours for the housekeeping processes to trigger the faulty code. Is there a way to speed up these internal processes via configuration, effectively triggering the error situation more often? What SQL table is relevant for PollWorkflowTaskQueue and is it a DB query or a DB mutation that times out here?

Thanks for your attention. KR, Martin.


Reason was: the Windows dev machine went to sleep mode and was woken up at the exact instant of the error. Clearly a timeout was detected. So nothing to worry about. This was logged into the system log on on wake-up:

The system has resumed from sleep.
--- then ---
The system time has changed to ‎2022‎-‎05‎-‎02T18:04:30.500000000Z from ‎2022‎-‎05‎-‎02T15:32:40.564565800Z.

Change Reason: System time synchronized with the hardware clock.
Process: '' (PID 4).