Disclaimer: The last time I really worked with Flink was more than 3 years ago. So I’m not an expert on the current implementation. So correct me if I’m misrepresenting it.
Flink stateful functions are exactly what they are called. They are functions that can explicitly load and store state through the provided API. They are closer to Akka and Microsoft Orleans than to Temporal. They give you ability to send asynchronous events to other functions by their business ID but don’t help you much with implementing complex business interactions. You are still responsible for programming your business logic fully asynchronously the way you usually do with normal RPC services and databases.
Temporal offers a much higher level of abstraction for developers. There is no explicit persistent state management as all state of a workflow including local variables and stack is always preserved. It allows writing synchronous code that blocks on external operations for an unlimited amount of time.
I don’t know much about internals of how stateful functions are implemented, but my uneducated guess is that they don’t use the optimizations that standard Flink stream processing uses to reduce number of checkpoints by replaying Kafka streams. So they don’t offer many advantages over Temporal in the number of IOPS executed on updates. It would be interesting to see how they perform given similar hardware.
At the current point, I would advise not using Temporal for such a high rate of events. The architecture allows it, but we never had business need or hardware capacity to perform any testing for such a high scale. I also would double-check if the reliability and consistency guarantee that Temporal offers really needed for use case that requires 3 million events per second. Usually, such high rate use cases can tolerate some data loss and inconsistency.
One clarification is that Temporal is an awesome fit for event processing when there is large number of business entities and each of the entities doesn’t get high rate of events. For example it is OK to have hundreds of millions of workflows each individual entity receiving no more than a few requests per second at peak. The use case it doesn’t support (and Flink works much better at this point) when a single entity has to aggregate high rate of events.