And I tried tuning the parameters mentioned here, but none of them made much difference.
You can follow a full Workers tuning guide for step by step walk through parameters and the corresponding metrics to check from the Worker side.
I would totally expect the default Worker config to start bump into the maxConcurrentWorkflowTaskExecutionSize
value on 100+ workflows/second if you have one worker only. So I would point attention to this parameter and the corresponded worker_task_slots_available
metric value.
From a server side somebody else will help better, but it’s a good idea to check db latency metrics.