Overview
• What makes a good metric?
• Intro to Cisco Spark
• What metrics are important for media?
• Architecture of a cloud calling system
• Enhanced metrics
• Customer triage examples
Slide 4
Slide 4 text
Good Metrics
• Get into the mind of the user
• Percentages vs absolute numbers
• Median vs mean vs percentiles
• Scalable, proactive, time invariant
Worst
10%
Worst
1%?
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
Quality is key
Slide 7
Slide 7 text
What metrics are important for media?
I have no idea how to represent network
jitter in picture form
Slide 8
Slide 8 text
Trunking
Providers
Media Flow
Cloud Calling Architecture
? ? ? ?
Customer
Network
Public
Internet
PAAS Third Party
Providers
Slide 9
Slide 9 text
Trunking
Providers
Metrics Flow
ELK Stack
Slide 10
Slide 10 text
Great, we’re done right?
• We’re still not in the mind of the customer.
• Three separate sources reporting media statistics
• Triaging individual calls is not sustainable at scale
• Not everyone is a Kibana ninja
• Need more insightful data
Slide 11
Slide 11 text
Enhanced Media Metrics
ELK Stack
Metrics Sources
Media Quality Analyser
CF Python app
Stateless
Plugin architecture
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
No content
Slide 14
Slide 14 text
No content
Slide 15
Slide 15 text
But I’m lazy..
• Monitoring dashboards isn’t fun
• Let’s set up alarming on our enhanced metrics
• Define what we consider ‘a problem’
• Overall 90th percentile packet loss increases above threshold
• Certain percentage of our customers exhibiting poor packet loss
• Particular segment experiencing poor packet loss
• Page on these and never have to look at the dashboard again
Slide 16
Slide 16 text
Triage examples
• Customer reports poor audio quality
• Check overall dashboard for system wide issues
• Check customer aggregations
• Where is the loss happening for this customer?
• Help them to triage their internal network issues
• Total time taken to isolate problem - <5 minutes
Slide 17
Slide 17 text
Triage examples
• Paged because overall packet loss spikes in system
• A subset of customers is suddenly seeing issues
• Many other customers are just fine
• Scratch head….
• ….where are the customers located?
• Find out about Comcast issue on the east coast of US
• Total time taken to isolate problem – 10-15 minutes
Slide 18
Slide 18 text
Triage examples
• Large customer suddenly spikes packet loss
• See that loss appears to be in customer network
• Reach out to customer
• Customer had fallen over to a backup network link for a day
• Customer impressed that we were so on top of things
• Happy customer trusts our cloud system more