Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Context & Contingency: Patterns for choosing good tools

Aaron Suggs
October 07, 2016

Context & Contingency: Patterns for choosing good tools

DevOpsDays Raleigh 2016

At Kickstarter, the Ops Engineering team recently deployed Elasticsearch/Logstash/Kibana to centralize application logs. I’ll use our implementation of the ELK stack as a case-study to examine our process for choosing tools, particularly:

when to build vs. buy
what features we need in the initial version, and what can wait until later
when to invest effort in an elegant solution, or when to get by with a hack.
This talk will discuss what factors influenced our decisions (context); and what future changes could lead to a different choice (contingency).

By capturing the context and contingency of our choices, we can more easily adapt to changes in our organization and the tech community.

I’ll show how our implementation decisions impacted the processes and behaviors of our team; which in turn influenced and reflected our organization’s culture.

https://www.devopsdays.org/events/2016-raleigh/program/aaron-suggs/

Aaron Suggs

October 07, 2016
Tweet

More Decks by Aaron Suggs

Other Decks in Technology

Transcript

  1. Context &
    Contingency
    Aaron Suggs

    View full-size slide

  2. Thank you
    Jen, Joe, Ann Marie,
    Kaete, Chris, Mark,
    and Katie!

    View full-size slide

  3. Aaron Suggs
    ktheory
    Ops Engineering

    View full-size slide

  4. helps creative projects
    come to life

    View full-size slide

  5. 1. ELK
    2. Management

    View full-size slide

  6. “It’s not a promotion,
    it’s a career change”
    — Lindsay Holmwood
    MGMT 101

    View full-size slide

  7. Kyle Burckhard
    built it
    @talaris

    View full-size slide

  8. Business
    Requirements

    View full-size slide

  9. “Look at the massive
    explosion in operational
    software offerings over the
    past 5-6 years.”
    — Charity Majors

    View full-size slide

  10. Maybe Splunk?

    View full-size slide

  11. App logs Vendor logs
    lograge S3 Event
    filebeat SQS worker
    logstash (+S3) logstash
    AWS Elasticsearch Service (with Kibana)

    View full-size slide

  12. Proof of concept

    View full-size slide

  13. filebeat with real data!

    View full-size slide

  14. filebeat with real data!
    “Do no harm”

    View full-size slide

  15. Trivia Time!

    View full-size slide

  16. Trivia Time!
    Why might a disk be 100%
    full after you delete several
    large files?

    View full-size slide

  17. Answer "
    A process has
    open file handles
    (use lsof)

    View full-size slide

  18. What to do?
    - Upgrade it! Thx GitHub.

    View full-size slide

  19. What to do?
    - Upgrade it! Thx GitHub.
    - Restart it.

    View full-size slide

  20. What to do?
    - Upgrade it! Thx GitHub.
    - Restart it.

    View full-size slide

  21. AWS Elasticsearch
    vs. DIY on EC2

    View full-size slide

  22. AWS Elasticsearch DIY on EC2

    View full-size slide

  23. AWS Elasticsearch DIY on EC2
    Less dev attention* More dev attention

    View full-size slide

  24. AWS Elasticsearch DIY on EC2
    Less dev attention* More dev attention

    View full-size slide

  25. AWS Elasticsearch DIY on EC2
    Less dev attention* More dev attention
    Less flexible Flexible, adaptable

    View full-size slide

  26. AWS Elasticsearch DIY on EC2
    Less dev attention* More dev attention
    Less flexible Flexible, adaptable


    View full-size slide

  27. AWS Elasticsearch DIY on EC2
    Less dev attention* More dev attention
    Less flexible Flexible, adaptable
    Hard to debug Lots of visibility


    View full-size slide

  28. AWS Elasticsearch DIY on EC2
    Less dev attention* More dev attention
    Less flexible Flexible, adaptable
    Hard to debug Lots of visibility



    View full-size slide

  29. AWS Elasticsearch DIY on EC2
    Less dev attention* More dev attention
    Less flexible Flexible, adaptable
    Hard to debug Lots of visibility
    Aligned w/ our use



    View full-size slide

  30. AWS Elasticsearch DIY on EC2
    Less dev attention* More dev attention
    Less flexible Flexible, adaptable
    Hard to debug Lots of visibility
    Aligned w/ our use




    View full-size slide

  31. Rails log & Logstash

    View full-size slide

  32. Rails log & Logstash

    View full-size slide

  33. Lograge
    github.com/roidrage/lograge

    View full-size slide

  34. When you show devs Kibana

    View full-size slide

  35. When you show devs Kibana

    View full-size slide

  36. MVP = better
    than alternative

    View full-size slide

  37. T-Shaped skills

    View full-size slide

  38. T-Shaped skills
    Me
    AWS
    Ruby

    View full-size slide

  39. T-Shaped skills
    Me
    AWS
    Ruby
    Kyle
    AWS
    Ruby
    Docker

    View full-size slide

  40. T-Shaped skills
    Me
    AWS
    Ruby
    Kyle
    AWS
    Ruby
    Docker
    ?
    Distributed
    systems
    Go

    View full-size slide

  41. Ingesting vendor logs
    - S3 access logs
    - CDN access logs
    - CloudTrail logs

    View full-size slide

  42. S3 Events
    -> SQS
    -> shoryuken ruby worker
    -> logstash

    View full-size slide

  43. S3 Events
    -> SQS
    -> shoryuken ruby worker
    -> logstash
    re-import old logs

    View full-size slide

  44. Autoscaling!

    View full-size slide

  45. Trivia Time!

    View full-size slide

  46. Trivia Time!
    What’s the bottleneck after
    you scale logstash?

    View full-size slide

  47. Answer "
    100% Elasticsearch CPU

    View full-size slide

  48. Answer "
    Then it crashes. $
    100% Elasticsearch CPU

    View full-size slide

  49. Mitigating Burnout
    MGMT 102

    View full-size slide

  50. Solution: Pairing
    1. Fosters career growth
    2. Cross-training
    3. Team bonding

    View full-size slide

  51. Summary
    1. Mind the skills of your team

    View full-size slide

  52. Summary
    1. Mind the skills of your team
    2. Know the next-best alternative

    View full-size slide

  53. Summary
    1. Mind the skills of your team
    2. Know the next-best alternative
    3. Consider a tool’s community

    View full-size slide

  54. Thank you!
    ktheory

    View full-size slide