Slide 1

Slide 1 text

Enhanced Media Metrics Shane Tuohy

Slide 2

Slide 2 text

Software Engineer @ @shanetuohy [email protected]

Slide 3

Slide 3 text

Overview • What makes a good metric? • Intro to Cisco Spark • What metrics are important for media? • Architecture of a cloud calling system • Enhanced metrics • Customer triage examples

Slide 4

Slide 4 text

Good Metrics • Get into the mind of the user • Percentages vs absolute numbers • Median vs mean vs percentiles • Scalable, proactive, time invariant Worst 10% Worst 1%?

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Quality is key

Slide 7

Slide 7 text

What metrics are important for media? I have no idea how to represent network jitter in picture form

Slide 8

Slide 8 text

Trunking Providers Media Flow Cloud Calling Architecture ? ? ? ? Customer Network Public Internet PAAS Third Party Providers

Slide 9

Slide 9 text

Trunking Providers Metrics Flow ELK Stack

Slide 10

Slide 10 text

Great, we’re done right? • We’re still not in the mind of the customer. • Three separate sources reporting media statistics • Triaging individual calls is not sustainable at scale • Not everyone is a Kibana ninja • Need more insightful data

Slide 11

Slide 11 text

Enhanced Media Metrics ELK Stack Metrics Sources Media Quality Analyser CF Python app Stateless Plugin architecture

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

But I’m lazy.. • Monitoring dashboards isn’t fun • Let’s set up alarming on our enhanced metrics • Define what we consider ‘a problem’ • Overall 90th percentile packet loss increases above threshold • Certain percentage of our customers exhibiting poor packet loss • Particular segment experiencing poor packet loss • Page on these and never have to look at the dashboard again

Slide 16

Slide 16 text

Triage examples • Customer reports poor audio quality • Check overall dashboard for system wide issues • Check customer aggregations • Where is the loss happening for this customer? • Help them to triage their internal network issues • Total time taken to isolate problem - <5 minutes

Slide 17

Slide 17 text

Triage examples • Paged because overall packet loss spikes in system • A subset of customers is suddenly seeing issues • Many other customers are just fine • Scratch head…. • ….where are the customers located? • Find out about Comcast issue on the east coast of US • Total time taken to isolate problem – 10-15 minutes

Slide 18

Slide 18 text

Triage examples • Large customer suddenly spikes packet loss • See that loss appears to be in customer network • Reach out to customer • Customer had fallen over to a backup network link for a day • Customer impressed that we were so on top of things • Happy customer trusts our cloud system more

Slide 19

Slide 19 text

Questions?