Maru: load testing at 10K executions per second, how many shards?

As per this article, 512 shards can support 10 (workflow)executions per second without growing backlog, i.e workflow started rate is very close to workflow closed rate.
However, 512 shards is not enough for 100 executions per second. I confirmed this by running it on 4096 shards, where it runs smoothly.
But, for 1000 executions/sec, even 16K shards show growing backlog.

Considering that the “workflow” under consideration is strictly maru/basic-test.json at master · temporalio/maru · GitHub , where count and ratePerSecond are bumped gradually…Is the following hypothesis correct?
10w/s → 512 shards
100w/s → 5120 shards
1000w/s → 51200 shards
10000w/s → 512000 shards

I’ve read that you guys have tested upto 128K shards, so 512K shards should be pretty unreasonable(?)

Note: we are indeed looking at this scale(10K TPS), even though our workflows would be bigger than one used in this test, they’re all very short lived