Context & Contingency: Patterns for choosing good tools

Context & Contingency: Patterns for choosing good tools

DevOpsDays Raleigh 2016

At Kickstarter, the Ops Engineering team recently deployed Elasticsearch/Logstash/Kibana to centralize application logs. I’ll use our implementation of the ELK stack as a case-study to examine our process for choosing tools, particularly:

when to build vs. buy
what features we need in the initial version, and what can wait until later
when to invest effort in an elegant solution, or when to get by with a hack.
This talk will discuss what factors influenced our decisions (context); and what future changes could lead to a different choice (contingency).

By capturing the context and contingency of our choices, we can more easily adapt to changes in our organization and the tech community.

I’ll show how our implementation decisions impacted the processes and behaviors of our team; which in turn influenced and reflected our organization’s culture.

https://www.devopsdays.org/events/2016-raleigh/program/aaron-suggs/

5888fc25101419e40b7de521f8524dad?s=128

Aaron Suggs

October 07, 2016
Tweet

Transcript

  1. Context & Contingency Aaron Suggs

  2. Thank you Jen, Joe, Ann Marie, Kaete, Chris, Mark, and

    Katie!
  3. Aaron Suggs ktheory Ops Engineering

  4. helps creative projects come to life

  5. 1. ELK 2. Management

  6. “It’s not a promotion, it’s a career change” — Lindsay

    Holmwood MGMT 101
  7. Kyle Burckhard built it @talaris

  8. Business Requirements

  9. “Look at the massive explosion in operational software offerings over

    the past 5-6 years.” — Charity Majors
  10. Hosted SAAS

  11. Maybe Splunk?

  12. ELK

  13. None
  14. App logs Vendor logs lograge S3 Event filebeat SQS worker

    logstash (+S3) logstash AWS Elasticsearch Service (with Kibana)
  15. Proof of concept

  16. !

  17. filebeat with real data!

  18. filebeat with real data! “Do no harm”

  19. Trivia Time!

  20. Trivia Time! Why might a disk be 100% full after

    you delete several large files?
  21. Answer " A process has open file handles (use lsof)

  22. What to do?

  23. What to do? - Upgrade it! Thx GitHub.

  24. What to do? - Upgrade it! Thx GitHub. - Restart

    it.
  25. What to do? - Upgrade it! Thx GitHub. - Restart

    it.
  26. AWS Elasticsearch vs. DIY on EC2

  27. None
  28. AWS Elasticsearch DIY on EC2

  29. AWS Elasticsearch DIY on EC2 Less dev attention* More dev

    attention
  30. AWS Elasticsearch DIY on EC2 Less dev attention* More dev

    attention ✅
  31. AWS Elasticsearch DIY on EC2 Less dev attention* More dev

    attention Less flexible Flexible, adaptable ✅
  32. AWS Elasticsearch DIY on EC2 Less dev attention* More dev

    attention Less flexible Flexible, adaptable ✅ ✅
  33. AWS Elasticsearch DIY on EC2 Less dev attention* More dev

    attention Less flexible Flexible, adaptable Hard to debug Lots of visibility ✅ ✅
  34. AWS Elasticsearch DIY on EC2 Less dev attention* More dev

    attention Less flexible Flexible, adaptable Hard to debug Lots of visibility ✅ ✅ ✅
  35. AWS Elasticsearch DIY on EC2 Less dev attention* More dev

    attention Less flexible Flexible, adaptable Hard to debug Lots of visibility Aligned w/ our use ✅ ✅ ✅
  36. AWS Elasticsearch DIY on EC2 Less dev attention* More dev

    attention Less flexible Flexible, adaptable Hard to debug Lots of visibility Aligned w/ our use ✅ ✅ ✅ ✅
  37. Rails log & Logstash

  38. Rails log & Logstash

  39. Lograge github.com/roidrage/lograge

  40. When you show devs Kibana

  41. When you show devs Kibana

  42. MVP = better than alternative

  43. T-Shaped skills

  44. T-Shaped skills Me AWS Ruby

  45. T-Shaped skills Me AWS Ruby Kyle AWS Ruby Docker

  46. T-Shaped skills Me AWS Ruby Kyle AWS Ruby Docker ?

    Distributed systems Go
  47. Ingesting vendor logs - S3 access logs - CDN access

    logs - CloudTrail logs
  48. S3 Events -> SQS -> shoryuken ruby worker -> logstash

  49. S3 Events -> SQS -> shoryuken ruby worker -> logstash

    re-import old logs
  50. Autoscaling!

  51. Trivia Time!

  52. Trivia Time! What’s the bottleneck after you scale logstash?

  53. Answer " 100% Elasticsearch CPU

  54. Answer " Then it crashes. $ 100% Elasticsearch CPU

  55. None
  56. Mitigating Burnout MGMT 102

  57. Solution: Pairing 1. Fosters career growth 2. Cross-training 3. Team

    bonding
  58. Summary

  59. Summary 1. Mind the skills of your team

  60. Summary 1. Mind the skills of your team 2. Know

    the next-best alternative
  61. Summary 1. Mind the skills of your team 2. Know

    the next-best alternative 3. Consider a tool’s community
  62. Thank you! ktheory