Hi all,
Right after upgrading Temporal from 1.13.1 => 1.18.4 with the following step:
- 1.13.1 → 1.15.2: update schema from 1.6 → 1.7
- 1.15.2 → 1.18.4: update schema from 1.7 → 1.8
I’m facing the following issues:
1- frontend errors are generating millions of log records
{
error
UnhandledCommand
level
info
logging-call-at
metric_client.go:92
service
frontend
service-error-type
serviceerror.InvalidArgument
ts
2023-04-08T12:44:22.848Z
}
2- this SQL query is called from 5 million to 20 million times
UPDATE
`executions`
SET
`db_record_version` = ?,
`next_event_id` = ?,
`last_write_version` = ?,
DATA = ?,
`data_encoding` = ?,
`state` = ?,
`state_encoding` = ?
WHERE
`shard_id` = ?
AND `namespace_id` = ?
AND `workflow_id` = ?
AND `run_id` = ?
3- Database storage usage is pumped 306G → 377G in 2 days
4- The replication lag sync from master db → read replica is increasing from 0 → 1 day (Seconds_Behind_Master: 87600)
My question:
- Why 1 is happening and how to fix it. Is there any issue ?
- Why 2 is called from 5 million to 20 million times
- Why the database storage usage is pumped
My assumption:
- The call times of the query are the main reason why the storage is increasing and leading to the replication lag but I’m not sure and of course, I need support from the community
I’m using:
GCP MySQL version: 8.0.26
GKE: 1.24.10-gke.2300