Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cost-Effective, Realtime Operational Insights Into Production Systems Producing Trillions of Events per Day

Jeff Chao
October 21, 2019

Cost-Effective, Realtime Operational Insights Into Production Systems Producing Trillions of Events per Day

At Netflix, we've experienced an unprecedented global increase in membership over the last several years. Not only are we seeing more members globally, more members are consuming more Netflix. This means that production outages today have far greater impact in much less time than it did compared to years before.

In order to continue providing great experiences for our members, we have to make sure the sophistication of our systems out-pace the growth and engagement of our members. Concretely, our MTTD and MTTR needs to decrease much quicker than Netflix membership and consumption increases. Our approach to accomplishing this is by having access to highly-granular, realtime operational insights into our streaming and studio systems.

However, while having this level of visibility into our production systems is great, it could quickly become cost-prohibitive. It's equally important that these systems don't end up costing more than our actual streaming and studio systems. To this end, we've built and open sourced Mantis to fulfill all of these needs – a platform that makes it easy for developers to build real-time, cost-effective, operations-focused applications.

Mantis has been live in production for several years and has given us tremendous value in operating Tier-1 critical systems. It processes trillions of events and petabytes worth of data every day which enables us to derive meaningful operational insights from our streaming and studio systems which ultimately reduce production impact on our members.

With Mantis, we're able to economically ask and answer new questions in realtime about our systems without having to add new instrumentation. We can answer questions like "Which members are seeing playback issues for Stranger Things, season 3, episode 1 on iPhone in Canada?" without incurring heavy costs to our infrastructure bill.

In this talk, we'll cover more technical details about Mantis and go through some examples of how we use Mantis to operate our production systems more effectively.

Jeff Chao

October 21, 2019
Tweet

More Decks by Jeff Chao

Other Decks in Programming

Transcript

  1. Platform Approach: Building Mantis In this first episode, Jeff walks

    through the realities of observability in microservices. Together with Monitorama attendees, they discover a platform called Mantis to help them navigate these realities. Mantis: Under The Hood After discovering Mantis, our heroes explore realize Mantis provides on-demand data fetching, aggressive data reuse, and native autoscaling to help them minimize the costs of observing and operating microservices without compromising on required and opportunistic insights. S E R I E S Agenda 1 2 3 Cost-Effective Operational Insight Applications with Mantis With Mantis putting the power into the hands of Jeff and the Monitorama attendees, they join up with two other groups to build cost-effective operational insight applications on top of Mantis.
  2. S E R I E S Agenda 1 2 3

    Platform Approach: Building Mantis In this first episode, Jeff walks through the realities of observability in microservices. Together with Monitorama attendees, they discover a platform called Mantis to help them navigate these realities. Mantis: Under The Hood After discovering Mantis, our heroes explore realize Mantis provides on-demand data fetching, aggressive data reuse, and native autoscaling to help them minimize the costs of observing and operating microservices without compromising on required and opportunistic insights. Cost-Effective Operational Insight Applications with Mantis With Mantis putting the power into the hands of Jeff and the Monitorama attendees, they join up with two other groups to build cost-effective operational insight applications on top of Mantis.
  3. {"ts":0,"status":202,"client_id":"5da94519f50db67cc5824bda","asg":0,"guid":"f77238e4-77 9a-4412-83fb-269f7e5be2b1","is_active":false,"origin":{"type1":"Kim","type2":"Harris"}, "country":"Afghanistan","latitude":"-25.555719","longitude":"-171.966253","tags":["sunt ","Lorem","culpa","magna","ut"]} {"ts":1,"status":201,"client_id":"5da94519bb95c26d42c365d5","asg":1,"guid":"be1e8508-4f 24-4369-9989-c094b619153c","is_active":false,"origin":{"type1":"Lakeisha","type2":"Reye s"},"country":"Madagascar","latitude":"-46.429467","longitude":"111.239606","tags":["co nsectetur","et","veniam","do","esse"]} {"ts":2,"status":500,"client_id":"5da9451924e02140e71be56b","asg":2,"guid":"89e1ba54-8e c2-4f83-a43a-3a7d707f74a7","is_active":false,"origin":{"type1":"Marla","type2":"Conrad"

    },"country":"Mali","latitude":"-71.282397","longitude":"-125.460257","tags":["Lorem","i rure","aute","qui","et"]} {"ts":3,"status":200,"client_id":"5da94519d9279275b6f36599","asg":3,"guid":"507d9f02-bc 8a-41db-b010-a0712ee23f2d","is_active":true,"origin":{"type1":"Estes","type2":"Black"}, "country":"Mozambique","latitude":"5.037181","longitude":"-158.935754","tags":["non","n isi","sit","exercitation","ad"]} {"ts":4,"status":200,"client_id":"5da945192a8086efef11b88e","asg":4,"guid":"563828c5-6a 57-4b66-882b-acab2a04aa13","is_active":false,"origin":{"type1":"Deanna","type2":"Alliso n"},"country":"Trinidad","latitude":"69.323194","longitude":"124.608163","tags":["conse quat","occaecat","cupidatat","magna","tempor"]}
  4. 3,000,000 GB App • 3,000,000 GB over the network •

    3,000,000 GB stored on disk publish
  5. {"ts":0,"status":202,"client_id":"5da94519f50db67cc5824bda","asg":0,"guid":"f77238e4-77 9a-4412-83fb-269f7e5be2b1","is_active":false,"origin":{"type1":"Kim","type2":"Harris"}, "country":"Afghanistan","latitude":"-25.555719","longitude":"-171.966253","tags":["sunt ","Lorem","culpa","magna","ut"]} {"ts":1,"status":201,"client_id":"5da94519bb95c26d42c365d5","asg":1,"guid":"be1e8508-4f 24-4369-9989-c094b619153c","is_active":false,"origin":{"type1":"Lakeisha","type2":"Reye s"},"country":"Madagascar","latitude":"-46.429467","longitude":"111.239606","tags":["co nsectetur","et","veniam","do","esse"]} {"ts":2,"status":500,"client_id":"5da9451924e02140e71be56b","asg":2,"guid":"89e1ba54-8e c2-4f83-a43a-3a7d707f74a7","is_active":false,"origin":{"type1":"Marla","type2":"Conrad"

    },"country":"Mali","latitude":"-71.282397","longitude":"-125.460257","tags":["Lorem","i rure","aute","qui","et"]} {"ts":3,"status":200,"client_id":"5da94519d9279275b6f36599","asg":3,"guid":"507d9f02-bc 8a-41db-b010-a0712ee23f2d","is_active":true,"origin":{"type1":"Estes","type2":"Black"}, "country":"Mozambique","latitude":"5.037181","longitude":"-158.935754","tags":["non","n isi","sit","exercitation","ad"]} {"ts":4,"status":200,"client_id":"5da945192a8086efef11b88e","asg":4,"guid":"563828c5-6a 57-4b66-882b-acab2a04aa13","is_active":false,"origin":{"type1":"Deanna","type2":"Alliso n"},"country":"Trinidad","latitude":"69.323194","longitude":"124.608163","tags":["conse quat","occaecat","cupidatat","magna","tempor"]}
  6. {"ts":0,"status":202,"client_id":"5da94519f50db67cc5824bda","asg":0,"guid":"f77238e4-77 9a-4412-83fb-269f7e5be2b1","is_active":false,"origin":{"type1":"Kim","type2":"Harris"}, "country":"Afghanistan","latitude":"-25.555719","longitude":"-171.966253","tags":["sunt ","Lorem","culpa","magna","ut"]} {"ts":1,"status":201,"client_id":"5da94519bb95c26d42c365d5","asg":1,"guid":"be1e8508-4f 24-4369-9989-c094b619153c","is_active":false,"origin":{"type1":"Lakeisha","type2":"Reye s"},"country":"Madagascar","latitude":"-46.429467","longitude":"111.239606","tags":["co nsectetur","et","veniam","do","esse"]} {"ts":2,"status":500,"client_id":"5da9451924e02140e71be56b","asg":2,"guid":"89e1ba54-8e c2-4f83-a43a-3a7d707f74a7","is_active":false,"origin":{"type1":"Marla","type2":"Conrad"

    },"country":"Mali","latitude":"-71.282397","longitude":"-125.460257","tags":["Lorem","i rure","aute","qui","et"]} {"ts":3,"status":200,"client_id":"5da94519d9279275b6f36599","asg":3,"guid":"507d9f02-bc 8a-41db-b010-a0712ee23f2d","is_active":true,"origin":{"type1":"Estes","type2":"Black"}, "country":"Mozambique","latitude":"5.037181","longitude":"-158.935754","tags":["non","n isi","sit","exercitation","ad"]} {"ts":4,"status":200,"client_id":"5da945192a8086efef11b88e","asg":4,"guid":"563828c5-6a 57-4b66-882b-acab2a04aa13","is_active":false,"origin":{"type1":"Deanna","type2":"Alliso n"},"country":"Trinidad","latitude":"69.323194","longitude":"124.608163","tags":["conse quat","occaecat","cupidatat","magna","tempor"]}
  7. {"ts":0,"status":202,"client_id":"5da94519f50db67cc5824bda","asg":0,"guid":"f77238e4-77 9a-4412-83fb-269f7e5be2b1","is_active":false,"origin":{"type1":"Kim","type2":"Harris"}, "country":"Afghanistan","latitude":"-25.555719","longitude":"-171.966253","tags":["sunt ","Lorem","culpa","magna","ut"]} {"ts":1,"status":201,"client_id":"5da94519bb95c26d42c365d5","asg":1,"guid":"be1e8508-4f 24-4369-9989-c094b619153c","is_active":false,"origin":{"type1":"Lakeisha","type2":"Reye s"},"country":"Madagascar","latitude":"-46.429467","longitude":"111.239606","tags":["co nsectetur","et","veniam","do","esse"]} {"ts":2,"status":500,"client_id":"5da9451924e02140e71be56b","asg":2,"guid":"89e1ba54-8e c2-4f83-a43a-3a7d707f74a7","is_active":false,"origin":{"type1":"Marla","type2":"Conrad"

    },"country":"Mali","latitude":"-71.282397","longitude":"-125.460257","tags":["Lorem","i rure","aute","qui","et"]} {"ts":3,"status":200,"client_id":"5da94519d9279275b6f36599","asg":3,"guid":"507d9f02-bc 8a-41db-b010-a0712ee23f2d","is_active":true,"origin":{"type1":"Estes","type2":"Black"}, "country":"Mozambique","latitude":"5.037181","longitude":"-158.935754","tags":["non","n isi","sit","exercitation","ad"]} {"ts":4,"status":200,"client_id":"5da945192a8086efef11b88e","asg":4,"guid":"563828c5-6a 57-4b66-882b-acab2a04aa13","is_active":false,"origin":{"type1":"Deanna","type2":"Alliso n"},"country":"Trinidad","latitude":"69.323194","longitude":"124.608163","tags":["conse quat","occaecat","cupidatat","magna","tempor"]}
  8. {"ts":0,"status":202,"client_id":"5da94519f50db67cc5824bda","asg":0,"guid":"f77238e4-77 9a-4412-83fb-269f7e5be2b1","is_active":false,"origin":{"type1":"Kim","type2":"Harris"}, "country":"Afghanistan","latitude":"-25.555719","longitude":"-171.966253","tags":["sunt ","Lorem","culpa","magna","ut"]} {"ts":1,"status":201,"client_id":"5da94519bb95c26d42c365d5","asg":1,"guid":"be1e8508-4f 24-4369-9989-c094b619153c","is_active":false,"origin":{"type1":"Lakeisha","type2":"Reye s"},"country":"Madagascar","latitude":"-46.429467","longitude":"111.239606","tags":["co nsectetur","et","veniam","do","esse"]} {"ts":2,"status":500,"client_id":"5da9451924e02140e71be56b","asg":2,"guid":"89e1ba54-8e c2-4f83-a43a-3a7d707f74a7","is_active":false,"origin":{"type1":"Marla","type2":"Conrad"

    },"country":"Mali","latitude":"-71.282397","longitude":"-125.460257","tags":["Lorem","i rure","aute","qui","et"]} {"ts":3,"status":200,"client_id":"5da94519d9279275b6f36599","asg":3,"guid":"507d9f02-bc 8a-41db-b010-a0712ee23f2d","is_active":true,"origin":{"type1":"Estes","type2":"Black"}, "country":"Mozambique","latitude":"5.037181","longitude":"-158.935754","tags":["non","n isi","sit","exercitation","ad"]} {"ts":4,"status":200,"client_id":"5da945192a8086efef11b88e","asg":4,"guid":"563828c5-6a 57-4b66-882b-acab2a04aa13","is_active":false,"origin":{"type1":"Deanna","type2":"Alliso n"},"country":"Trinidad","latitude":"69.323194","longitude":"124.608163","tags":["conse quat","occaecat","cupidatat","magna","tempor"]}
  9. App subscribe 1,000 GB • 1,000 GB over the network

    • 1,000 GB stored in memory SELECT status FROM stream WHERE status = 500 3,000,000 GB
  10. S E R I E S Agenda 1 2 3

    Platform Approach: Building Mantis In this first episode, Jeff walks through the realities of observability in microservices. Together with Monitorama attendees, they discover a platform called Mantis to help them navigate these realities. Mantis: Under The Hood After discovering Mantis, our heroes explore realize Mantis provides on-demand data fetching, aggressive data reuse, and native autoscaling to help them minimize the costs of observing and operating microservices without compromising on required and opportunistic insights. Cost-Effective Operational Insight Applications with Mantis With Mantis putting the power into the hands of Jeff and the Monitorama attendees, they join up with two other groups to build cost-effective operational insight applications on top of Mantis.
  11. Job Job Job App App App Source Job external Mantis

    Infrastructure 1. Consume external events 2. Manage subscriptions
  12. • Aggregates (Count, Avg, etc) • GROUP BY • HAVING

    • ORDER BY • WINDOW • JOIN • ANOMALY • SELECT • FROM • WHERE • SAMPLE
  13. Job Job Job App App App Source Job SELECT count(device),

    status FROM stream WHERE status = 500 GROUP BY device WINDOW 10 1 ORDER BY count(device)
  14. Job Job Job App App App Source Job SELECT device,

    status FROM stream WHERE status = 500
  15. Job Job Job App App App Source Job { "device":

    "ps4", "status": 500 } { "device": "tv", "status": 500 } { "device": "ps4", "status": 500 }
  16. Job Job Job App App App Source Job SELECT count(device),

    status FROM stream WHERE status = 500 GROUP BY device WINDOW 10 1 ORDER BY count(device)
  17. Job Job Job App App App Source Job { "device":

    "ps4", "count": 2 } { "device": "tv", "count": 1 }
  18. Job Job Job App App App Source Job SELECT device,

    status FROM stream WHERE status = 500 && device = "ps4"
  19. Job Job Job App App App Source Job { "device":

    "ps4", "status": 500 } { "device": "ps4", "status": 500 }
  20. ← On-demand Job Job Job App App App Source Job

    ← Reuse & Autoscaling ← Reuse & Autoscaling
  21. Sink (one) Worker Worker Worker Worker Worker Worker Worker Worker

    Worker Source (one) Processing Stages (zero to many) Autoscaled Autoscaled Autoscaled
  22. S E R I E S Agenda 1 2 3

    Platform Approach: Building Mantis In this first episode, Jeff walks through the realities of observability in microservices. Together with Monitorama attendees, they discover a platform called Mantis to help them navigate these realities. Mantis: Under The Hood After discovering Mantis, our heroes explore realize Mantis provides on-demand data fetching, aggressive data reuse, and native autoscaling to help them minimize the costs of observing and operating microservices without compromising on required and opportunistic insights. Cost-Effective Operational Insight Applications with Mantis With Mantis putting the power into the hands of Jeff and the Monitorama attendees, they join up with two other groups to build cost-effective operational insight applications on top of Mantis.
  23. alert fires! rolling count 8 of 10 propagation + aggregation

    delay static threshold errors errors time actual start of incident Alert fires 15 minutes after the actual start of the incident! 0m 5m 15m
  24. • Sub-second event data propagation & aggregation • Self-tuning dynamic

    thresholds • Consolidated timeline view across applications • Auto recovery detection
  25. 7:11:33 AM - Zuul-Website (cluster), API Service (origin) started having

    connectivity issues at 13.08% 7:12:10 AM - API Service (origin) started failing with timeouts at 1.29% 7:12:22 AM - Zuul-Website (cluster) started failing with timeouts at 1.78% 7:14:50 AM - API Service (origin) started throttling retries (e.g. retry storm) at 43.86% 7:14:52 AM - Zuul-API (cluster), API Service (origin) started throttling retries (e.g. retry storm) at 63.12% 7:14:52 AM - Zuul-API (cluster) started throttling retries (e.g. retry storm) at 60.24%
  26. • Ad-hoc queries to any event stream at Netflix •

    On-the-fly transformations • Optionally persist results to long-term storage
  27. S E R I E S Wrapping Up 1 2

    3 Platform Approach: Building Mantis In this first episode, Jeff walks through the realities of observability in microservices. Together with Monitorama attendees, they discover a platform called Mantis to help them navigate these realities. Mantis: Under The Hood After discovering Mantis, our heroes explore realize Mantis provides on-demand data fetching, aggressive data reuse, and native autoscaling to help them minimize the costs of observing and operating microservices without compromising on required and opportunistic insights. Cost-Effective Operational Insight Applications with Mantis With Mantis putting the power into the hands of Jeff and the Monitorama attendees, they join up with two other groups to build cost-effective operational insight applications on top of Mantis.