We currently use AWS RDS (postgresql) as the database behind temporal, and have not had any issues for the last couple years with this. Last week, a minor version upgrade of the DB was performed automatically by AWS, and after that we noticed that none of our workers were picking up any workflows. We eventually restarted all the temporal pods (k8s, deployed via helm chart), and this resolved the issue. We are still unsure as to what could have caused this issue, so that we can prevent this in the future. Following were the logs for the different services:
History:
Jul 8, 2024 @ 22:39:58.766 Operation failed with internal error. GetTransferTasks operation failed. Select failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.769 Operation failed with internal error. RangeCompleteTransferTask operation failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.771 Error range completing queue task RangeCompleteTimerTask operation failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.771 Operation failed with internal error. RangeCompleteTimerTask operation failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.773 Queue reader unable to retrieve tasks GetVisibilityTasks operation failed. Select failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.773 Operation failed with internal error. GetVisibilityTasks operation failed. Select failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.778 Operation failed with internal error. GetVisibilityTasks operation failed. Select failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.781 Operation failed with internal error. RangeCompleteTimerTask operation failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.783 Operation failed with internal error. GetTransferTasks operation failed. Select failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.786 Operation failed with internal error. UpdateShard failed. Failed to start transaction. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.922 Operation failed with internal error. RangeCompleteVisibilityTask operation failed. Error: context deadline exceeded
Jul 8, 2024 @ 22:39:58.923 Error range completing queue task context deadline exceeded
Jul 8, 2024 @ 22:39:58.979 Operation failed with internal error. GetVisibilityTasks operation failed. Select failed. Error: context deadline exceeded
Jul 8, 2024 @ 22:39:58.979 Queue reader unable to retrieve tasks context deadline exceeded
Jul 8, 2024 @ 22:39:58.991 Range updated for shardID -
Jul 8, 2024 @ 22:39:58.991 Acquired shard -
Jul 8, 2024 @ 22:39:58.994 Range updated for shardID -
Jul 8, 2024 @ 22:39:58.994 Acquired shard -
Jul 8, 2024 @ 22:39:59.002 Acquired shard -
Jul 8, 2024 @ 22:39:59.002 Range updated for shardID -
Jul 8, 2024 @ 22:41:26.791 Critical attempts processing workflow task -
Jul 8, 2024 @ 22:43:11.843 Critical attempts processing workflow task -
Jul 8, 2024 @ 22:44:16.455 Critical attempts processing workflow task -
Jul 8, 2024 @ 22:44:44.697 Critical attempts processing workflow task -
Jul 8, 2024 @ 22:45:06.304 Critical attempts processing workflow task -
Jul 8, 2024 @ 22:50:32.557 Critical attempts processing workflow task -
Jul 8, 2024 @ 22:53:02.097 Critical attempts processing workflow task -
Jul 8, 2024 @ 22:53:13.144 Critical attempts processing workflow task -
Jul 8, 2024 @ 22:53:28.679 Blob data size exceeds the warning limit. -
Jul 8, 2024 @ 22:54:07.726 Critical attempts processing workflow task -
Jul 8, 2024 @ 22:54:30.075 Critical attempts processing workflow task -
Jul 8, 2024 @ 22:59:22.038 Critical attempts processing workflow task -
Jul 8, 2024 @ 23:01:00.402 Blob data size exceeds the warning limit. -
And this Critical attempts processing workflow task keeps going on
Worker service
Jul 8, 2024 @ 22:39:55.826 Operation failed with internal error. GetMetadata operation failed. Error: dial tcp 10.11.120.95:5432: connect: connection refused
Jul 8, 2024 @ 22:39:55.921 Error refreshing namespace cache GetMetadata operation failed. Error: dial tcp 10.11.120.95:5432: connect: connection refused
Jul 8, 2024 @ 22:39:55.921 Operation failed with internal error. GetMetadata operation failed. Error: dial tcp 10.11.120.95:5432: connect: connection refused
Jul 8, 2024 @ 22:39:58.528 Operation failed with internal error. UpsertClusterMembership operation failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.529 Operation failed with internal error. GetMetadata operation failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.582 Operation failed with internal error. UpsertClusterMembership operation failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.584 Operation failed with internal error. GetMetadata operation failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.670 Operation failed with internal error. GetMetadata operation failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.670 Error refreshing namespace cache GetMetadata operation failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.676 Operation failed with internal error. UpsertClusterMembership operation failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:39:58.676 Membership upsert failed. UpsertClusterMembership operation failed. Error: pq: the database system is starting up
Jul 8, 2024 @ 22:45:12.236 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:12.236 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:13.906 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:15.046 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:21.858 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:21.962 deleted history garbage -
Jul 8, 2024 @ 22:45:22.435 deleted history garbage -
Jul 8, 2024 @ 22:45:22.542 deleted history garbage -
Jul 8, 2024 @ 22:45:22.557 deleted history garbage -
Jul 8, 2024 @ 22:45:23.329 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:25.893 deleted history garbage -
Jul 8, 2024 @ 22:45:26.060 deleted history garbage -
Jul 8, 2024 @ 22:45:26.485 deleted history garbage -
Jul 8, 2024 @ 22:45:26.519 deleted history garbage -
Jul 8, 2024 @ 22:45:26.588 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:27.222 deleted history garbage -
Jul 8, 2024 @ 22:45:27.899 deleted history garbage -
Jul 8, 2024 @ 22:45:28.633 deleted history garbage -
Jul 8, 2024 @ 22:45:31.222 deleted history garbage -
Jul 8, 2024 @ 22:45:32.142 deleted history garbage -
Jul 8, 2024 @ 22:45:32.526 deleted history garbage -
Jul 8, 2024 @ 22:45:32.668 deleted history garbage -
Jul 8, 2024 @ 22:45:33.000 deleted history garbage -
Jul 8, 2024 @ 22:45:33.979 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:35.095 deleted history garbage -
Jul 8, 2024 @ 22:45:35.369 deleted history garbage -
Jul 8, 2024 @ 22:45:36.201 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:36.909 deleted history garbage -
Jul 8, 2024 @ 22:45:39.312 deleted history garbage -
Jul 8, 2024 @ 22:45:39.441 deleted history garbage -
Jul 8, 2024 @ 22:45:39.454 deleted history garbage -
Jul 8, 2024 @ 22:45:39.478 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:39.607 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:40.130 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:40.288 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:41.778 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:42.893 deleted history garbage -
Jul 8, 2024 @ 22:45:43.029 deleted history garbage -
Jul 8, 2024 @ 22:45:43.487 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:43.827 deleted history garbage -
Jul 8, 2024 @ 22:45:43.984 deleted history garbage -
Jul 8, 2024 @ 22:45:44.821 deleted history garbage -
Jul 8, 2024 @ 22:45:44.939 deleted history garbage -
Jul 8, 2024 @ 22:45:46.245 deleted history garbage -
Jul 8, 2024 @ 22:45:46.322 deleted history garbage -
Jul 8, 2024 @ 22:45:46.398 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
Jul 8, 2024 @ 22:45:48.817 encounter error when describing the mutable state Namespace 7fc1aceb-73bd-4be1-9ab4-23a17971a524 is not found.
After losing connection to the DB, it seemed to not be able to recover somehow. Any help on this would be appreciated.