Validation of Voip architecture using temporal

Implementation

We are trying to implement an frontend that makes monitoring in realtime using websockets events with temporal (python-sdk) to ensure that events will be delivered and metrics be consistent. Services like pbx, workers, websocket-server, backend, frontend are running in K8S cluster. PBX’s pods are sending the generated events (each event represents some changed state like answer a call, abandon a call, call completion, and so on) to temporal in signal form. Currently I have a long running workflow by tenant (wf-websocket-tenant1, wf-websocket-tenant2, wf-websocket-tenantN), where they receive signals from pbx’s process. I have just 1 task queue called “websocket”, where all workers runs. Each tenant workflow maintain the state of the metrics: number of completed calls by agents, number of not answered calls by agent, talktime, holdtime, current state of extensions, and so on. We are using “Continue as new” in workflow when certain number of signals is reached. Each workflow has around 10 activities. Once a signal is received by workflow, then the metrics are recalculated and the new values are stored in redis (each tenant has it own key like “metrics-tenant1”, “metrics-tenant2”, “metrics-tenantN”) and then published in Redis Pub/ub channel. The clients (websocket services) listen the new metric values and propagate to your clients.

Architecture

Problem

I implement the same architecture in a lowest scale (1 company, 10 extensions, 2 simultaneos calls) and noted some inconsistencies in execution of programatically workflow logic, where make the state of extensions to be different from real state sometimes, and its going so pain to maintain the code. Its hard to change somethig and dont break another thing even having well segmented the classes, functions, activities, workflow logic, etc. It Seems that the logical is effected because of the volume of signals received, but its is just an impression.

K8S Cluster dimension (Max size possible per cluster)

  • 500 companies
  • 100 extensions by company
  • Total extensions per cluster: 50.000
  • Simultaneos calls: 5000 (10% of total extensions)
  • Number of estimated events per second: 2000
  • Payload of each event: 500 bytes

Questions

  • Are we using temporal the correct way to solve our problem?
  • Is the use of signals correct, or its preferable to start 1 workflow per event instead send signals?
  • The usage of just 1 task queue for any kind of event is correct?
  • The usage of all workers taking jobs of any tenant is correct?
  • Will the amount of data being passed around workflows and Activities pose a problem? What are the alternatives?

By “extension”, are you referring to someone’s phone number extension, e.g. “call me at 555-1212 extension 678”?

The state of the workflow will be deterministic based on the signals received (assuming you haven’t incorrectly used any non-deterministic logic in the workflow: Develop code that durably executes - Python SDK dev guide | Temporal Documentation)

If you see a workflow that’s in an incorrect state based on the signals it received, you can replay the workflow in a development environment and see where the workflow logic is producing the wrong state.

  • Number of estimated events per second: 2000

What is the maximum number of events per tenant (which maps to a single workflow instance in your design)?

  • Are we using temporal the correct way to solve our problem?

Yes, assuming that maximum number of events per second per workflow is below 10

  • Is the use of signals correct, or its preferable to start 1 workflow per event instead send signals?

Yes, this is very common pattern.

  • The usage of just 1 task queue for any kind of event is correct?

Yes, a single task queue is absolutely fine.

  • The usage of all workers taking jobs of any tenant is correct?

Yes, as workflows are cached at a worker.

  • Will the amount of data being passed around workflows and Activities pose a problem? What are the alternatives?

500 bytes is a reasonable payload size. If you want to store even less you can store the payload in some external system and only pass around references.

noted some inconsistencies in execution of programatically workflow logic, where make the state of extensions to be different from real state sometimes, and its going so pain to maintain the code.

It is hard to comment on this without any specifics. Usually, workflows are fully consistent and easier to maintain than alternatives.