Error with matching service in Cadence

Two days ago, we started presenting some issues with our cadence setup.
The first thing we noticed is the Open workflows were not disappearing from the list once they completed. For example this workflow appears as Open in the list:

But when you click on it, you will see that it’s actually completed:

MicrosoftTeams-image (2)

At the same time this started to happen, we noticed how several workflows would take quite a long time to complete, several of them would stuck in “Schedule” states and never go further from there. After checking the logs, the only error we saw was this:

{"level":"error","ts":"2021-03-06T19:12:04.865Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"cadence-sys-history-scanner-tasklist-0","wf-task-list-type":1,"store-operation":"create-task","error":"InternalServiceError{Message: CreateTasks operation failed. Error : Request on table cadence.tasks with ttl of 630720000 seconds exceeds maximum supported expiration date of 2038-01-19T03:14:06+00:00. In order to avoid this use a lower TTL, change the expiration date overflow policy or upgrade to a version where this limitation is fixed. See CASSANDRA-14092 for more details.}","wf-task-list-name":"cadence-sys-history-scanner-tasklist-0","wf-task-list-type":1,"number":6300094,"next-number":6300094,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"}

Does somebody have an idea of why this is happening?

Hi! The first issue is just UI. The open list view may be less up to date than for the individual workflow.

The last issue is interesting - unix-based operating systems have a 2038 problem similar to the y2k issue. The timestamp is a signed 32-bit integer. Cassandra has some issues with handing TTL on records that expire after 2038. The task in question looks like it has a TTL of 20 years.

The error from Cassandra tells you what you need to do though (if you check the referenced jira issue in the error message) - there may be a newer version with a fix or you may need to adjust the ttl overflow policy.

If you’ve set your own expiration on anything to 20 years you may need to set it lower - though this looks like it might be a cadence internal workflow? If so i believe we’ve fixed that issue in temporal - but someone else might know better.

Workflows taking a long time to complete may be unrelated to all of the other issues here - might need more info to debug those if correcting your issue with Cassandra doesn’t fix it.