Implementation
We are trying to implement an frontend that makes monitoring in realtime using websockets events with temporal (python-sdk) to ensure that events will be delivered and metrics be consistent. Services like pbx, workers, websocket-server, backend, frontend are running in K8S cluster. PBX’s pods are sending the generated events (each event represents some changed state like answer a call, abandon a call, call completion, and so on) to temporal in signal form. Currently I have a long running workflow by tenant (wf-websocket-tenant1, wf-websocket-tenant2, wf-websocket-tenantN), where they receive signals from pbx’s process. I have just 1 task queue called “websocket”, where all workers runs. Each tenant workflow maintain the state of the metrics: number of completed calls by agents, number of not answered calls by agent, talktime, holdtime, current state of extensions, and so on. We are using “Continue as new” in workflow when certain number of signals is reached. Each workflow has around 10 activities. Once a signal is received by workflow, then the metrics are recalculated and the new values are stored in redis (each tenant has it own key like “metrics-tenant1”, “metrics-tenant2”, “metrics-tenantN”) and then published in Redis Pub/ub channel. The clients (websocket services) listen the new metric values and propagate to your clients.
Architecture
Problem
I implement the same architecture in a lowest scale (1 company, 10 extensions, 2 simultaneos calls) and noted some inconsistencies in execution of programatically workflow logic, where make the state of extensions to be different from real state sometimes, and its going so pain to maintain the code. Its hard to change somethig and dont break another thing even having well segmented the classes, functions, activities, workflow logic, etc. It Seems that the logical is effected because of the volume of signals received, but its is just an impression.
K8S Cluster dimension (Max size possible per cluster)
- 500 companies
- 100 extensions by company
- Total extensions per cluster: 50.000
- Simultaneos calls: 5000 (10% of total extensions)
- Number of estimated events per second: 2000
- Payload of each event: 500 bytes
Questions
- Are we using temporal the correct way to solve our problem?
- Is the use of signals correct, or its preferable to start 1 workflow per event instead send signals?
- The usage of just 1 task queue for any kind of event is correct?
- The usage of all workers taking jobs of any tenant is correct?
- Will the amount of data being passed around workflows and Activities pose a problem? What are the alternatives?