$30 off During Our Annual Pro Sale. View Details »

Graph Everything!

Oliver Hankeln
September 20, 2013

Graph Everything!

This is my talk for Monitorama EU 2013. Covering how we at gutefrage.net store our Metrics in openTSDB and our experiences with it.
Also some thoughts about related topics like alerting, some creative (or crazy?) metrics.

Oliver Hankeln

September 20, 2013
Tweet

More Decks by Oliver Hankeln

Other Decks in Programming

Transcript

  1. Graph everything!
    Oliver Hankeln / gutefrage.net
    Samstag, 21. September 13

    View Slide

  2. Who am I?
    Senior Engineer - Data and Infrastructure at
    gutefrage.net GmbH
    Was doing software development before
    DevOps advocate
    Samstag, 21. September 13

    View Slide

  3. Who is Gutefrage.net?
    Germany‘s biggest Q&A platform
    #1 German site (mobile) about 5M Unique Users
    #3 German site (desktop) about 17M Unique Users
    > 4 Mio PI/day
    Part of the Holtzbrinck group
    Running several platforms (Gutefrage.net,
    Helpster.de, Cosmiq, Comprano, ...)
    Samstag, 21. September 13

    View Slide

  4. Flight AB6188
    Samstag, 21. September 13

    View Slide

  5. What you will get
    How do we store our metrics?
    Our experiences with that setup
    Why the hell are we doing that?
    Some thoughts on metrics
    Samstag, 21. September 13

    View Slide

  6. How we store our
    metrics
    Samstag, 21. September 13

    View Slide

  7. Our requirements
    Creating new metrics has to be simple
    no compaction (bye bye RRDTool)
    System has to scale
    Samstag, 21. September 13

    View Slide

  8. openTSDB
    Written at StumbleUpon but OpenSource
    Uses HBase as a storage
    Distributed system (multiple TSDs)
    Samstag, 21. September 13

    View Slide

  9. The ecosystem
    App feeds metrics in via RabbitMQ
    We base Icinga checks on the metrics
    We evaluate etsy Skyline for anomaly
    detection
    We deploy sensors via chef
    Samstag, 21. September 13

    View Slide

  10. Our experiences
    Samstag, 21. September 13

    View Slide

  11. What works well
    We store about 200M data points in several
    thousand time series with no issues
    tcollector is decoupling measurement from
    storage
    Creating new metrics is really easy
    Samstag, 21. September 13

    View Slide

  12. Challenges
    The UI is seriously lacking
    no annotation support out of the box
    Only 1s time resolution (and only 1 value/s/
    time series)
    Samstag, 21. September 13

    View Slide

  13. salvation is coming
    OpenTSDB 2 is around the corner
    millisecond precision
    annotations and meta data
    decent API
    Samstag, 21. September 13

    View Slide

  14. Why the hell are we
    doing this?
    Samstag, 21. September 13

    View Slide

  15. Communication
    Replace gut feeling
    with real data
    Helps to avoid the
    blame game
    Brains prefer graphs
    to numbers
    Samstag, 21. September 13

    View Slide

  16. Getting insights
    We move towards Continuous Deployment
    Complex systems show emergent behaviour
    Graphs are the correct flight level
    Samstag, 21. September 13

    View Slide

  17. Lean Startup
    Build - Measure - Learn cycle
    You have to define measureable goals
    No. It‘s measure not guessing
    Samstag, 21. September 13

    View Slide

  18. Perspectives
    Operations (Server load, traffic, disk space,...)
    Developers (DB Queries/PageView, JS
    errors,...)
    Product Owners (Content creation, Content
    Quality, ...)
    ...
    Samstag, 21. September 13

    View Slide

  19. Some random thoughts
    Samstag, 21. September 13

    View Slide

  20. Public display
    Helps that everyone
    feels involved
    n+1 eyes see more
    than n eyes
    Needs a culture of
    trust
    Samstag, 21. September 13

    View Slide

  21. Alerting
    Fixed values for alerts are not good enough
    Drawing Attention vs. Alerting
    False positives are bugs
    Don‘t call the on-call-guy for nothing
    Samstag, 21. September 13

    View Slide

  22. Metrics != boring
    You can (and should) get creative with what
    you measure.
    Have some brainstorming sessions
    Insights may come from surprising places
    Samstag, 21. September 13

    View Slide

  23. Track team happiness
    There is no fixed
    scale
    It forces you to
    communicate
    If you listen you can
    find problems in the
    team
    Samstag, 21. September 13

    View Slide

  24. Track ops confidence
    create a platform
    where you can buy or
    sell your on-call
    shifts.
    The price for a shift
    tells you how
    confident the team is.
    This has not been
    tested - yet.
    Samstag, 21. September 13

    View Slide

  25. Track recruiting efforts
    Helps to get a feeling
    about the job market
    Reminds everyone to
    keep looking for new
    colleagues
    BTW: we are
    hiring ;-)
    Samstag, 21. September 13

    View Slide

  26. Questions?
    Please contact me:
    [email protected]
    @mydalon
    I‘ll upload the slides and tweet about it
    Samstag, 21. September 13

    View Slide

  27. one more thing
    Samstag, 21. September 13

    View Slide

  28. Please give feedback!
    [email protected]
    @mydalon
    Samstag, 21. September 13

    View Slide

  29. Image Sources:
    Plane: Felix Gottwald - www.felixgottwald.net (Creative Commons Attribution Share
    Alike 3.0German)
    Talking men: Deutsche Fotothek - Peter, Richard sen.
    Money: Wikimedia contributor Avij
    Other images: Oliver Hankeln
    This presentation is licenced under Creative Commons
    Attribution Share Alike 3.0
    Samstag, 21. September 13

    View Slide