Greetings,
We are self-hosting Temporal (v1.18.5 – yes… old) and Cassandra (v3.11.13) as our execution database. We have 4096 history shards.
We are seeing the following warnings in our database logs:
WARN [ReadStage-3] 2024-03-11 14:15:34,305 ReadCommand.java:576 - Read 1 live rows and 5002 tombstone cells for query
SELECT activity_map, activity_map_encoding, buffered_events_list, checksum, checksum_encoding, child_executions_map,
child_executions_map_encoding, db_record_version, execution, execution_encoding, execution_state,
execution_state_encoding, next_event_id, request_cancel_map, request_cancel_map_encoding, signal_map,
signal_map_encoding, signal_requested, timer_map, timer_map_encoding FROM temporal.executions WHERE shard_id = 3603 AND
(type, namespace_id, workflow_id, run_id, visibility_ts, task_id) = (1, 1b0; token -586202983453943418 (see
tombstone_warn_threshold)
While the query seems to be slightly truncated, I think this matches the get workflow execution query.
During times when this is being logged (and sometimes when it is not being widely logged, presumably because it is below the logging threshold) we see elevations in latency from Cassandra evident in the persistence_latency_bucket
metrics from the history server.
Our Cassandra admins have expressed concerns about how Temporal uses Cassandra for storing executions and these tombstone warnings that it produces. I believed we should have been in the clear based upon watching this presentation on how Cadence (and I assume Temporal) avoids the issue. If I understand correctly, while tombstones will be produced, Temporal uses specific query patterns to avoid reading them. So the logs are surprising - especially on the order of 64 million logged events on about 30 million workflows.
I’m wondering if we might be doing something inadvisable in our cluster configuration, software versions, or in how our workflows operate? Some things I can think of in our use case (which is a high-volume Workflow-as-Actor sort of model) that might be outside the norm:
- We had an inadvisable implementation of a polling loop in workflow code using repeated
Workflow.await
. This is already being revised, as it led to long workflow histories. Could those long histories themselves cause excessive tombstone reads? - We are using polling via
getResult
on our workflow stub in the client to periodically check to see whether our workflow has completed yet. I’m not sure what sorts of queries this might run on the underlying database as a result. Might it be the one implicated?
Thanks in advance for any suggestions or ideas on what might be causing the issue!