$30 off During Our Annual Pro Sale. View Details »

Berlin 2013 - Session - Jeff Weinstein

Monitorama
September 20, 2013
560

Berlin 2013 - Session - Jeff Weinstein

Monitorama

September 20, 2013
Tweet

Transcript

  1. How  monitoring  can  improve  
    the  rest  of  the  company  
     
     
    Monitorama  EU  2013  
    @jeff_weinstein  

    View Slide

  2. I
    real-time
    and batch
    data analytics

    View Slide

  3. Monitoring  can  wildly  improve    
    the  whole  company  by  
    sharing  data    
    and  sharing  techniques.  

    View Slide

  4. Monitoring  Folks  
    Developers  
    Business    
    Analysts  
    ExecuIves  
    &  Product  
    Data    
    ScienIsts  
    Data  

    View Slide

  5. Apps  &  
    Services  &  
    Systems  
    Users  
    Data  
    Code  &  
    Config  
    Monitoring  

    View Slide

  6. Some  problems…  

    View Slide

  7. Data  Processing  
    Apps  
    Systems  
    Logs  /  
    Events  
    Metrics  
    Graphs  
    &  Alerts  
    Apps  
    3rd  Party  
    Reports  &  
    Queries  
    ETL  
    AnalyIc  
    Systems  
    Monitoring:  Streaming  
    BI:  Batch  

    View Slide

  8. Data  Needs  
    Logs   Metrics   Logs   Metrics  
    Streaming   Batch  
    Data  
    Monitoring  
    BI  

    View Slide

  9. Data  Tools  Stack  
    Monitoring  
    •  Ad  hoc  
    –  sed,  grep,  awk  
    –  ES,  LogStash,  Splunk,  …  
    •  Storage  
    –  Hosts,  Ganglia,  OTSDB  
    –  Central  syslog  server  
    •  VisualizaIon/ReporIng  
    –  Graphite,  RRDTool,  3rd  party  
    –  Homegrown  
    •  AlerIng/EscalaIon    
    –  Nagios,  Sensu,  PagerDuty,  …  
    Rest  of  company  
    •  Ad  hoc  
    –  Excel,  SQL,  Hive  
    –  MapReduce,  …  
    •  Storage  
    –  Lots  o’  databases,  Excel  
    –  Hadoop,  RDBMS…  
    •  VisualizaIon/ReporIng  
    –  Excel,  R,  Tableau  ...  
    –  Dinosaur  apps,  …  
    •  AlerIng/EscalaIon    
    –  nada  

    View Slide

  10. Metrics  

    View Slide

  11. Views  
    Unintelligible  generated  views  
    Too  granular  for  long  term  trends  
    Lack  of  historical   Intolerant  to  anomalies  

    View Slide

  12. Team  and  incenIves  
    •  What  team?  
    •  Change  vs.  reliability  
    •  Planning  
    •  Budget  
    •  Churn  

    View Slide

  13. Good  or  bad?  
    •  Specific  Tools  
    •  Decentralized  
    •  Focus  
    •  Ownership  
    •  Lost  context  
    •  Siloed  work  
    •  Data  dark  
    •  Misunderstanding  

    View Slide

  14. Some  fixes  

    View Slide

  15. End  to  End  Data  Pipeline  
    ü Structured  logs  
    ü (Config)  
    ü Measure  once  
    ü AutomaIc  metrics  
    ü API  
    ü Graph  tools  
    ü Glossary  
    ü AnnotaIons  and  tags  
    ü Pipeline  

    View Slide

  16. Structured  events  
    •  JSON  (or  whatever)  
    •  (opIonal)  config  
    •  Tags  per  key  
    – Type  
    – Tag:  latency,  funnel,…  
    – DescripIon  
    – Storage  

    View Slide

  17. Auto:  Graphs,  Glossary,  &  Storage  
    •  Graphs  and  dashboards  
    •  *  templates  
    •  Views  and  stats  
    •  Glossary  
    •  Batch  analyIcs  
    •  Long  term  storage  

    View Slide

  18. build  
    learn  
    communicate  
    inspire  

    View Slide

  19. Developers  
    •  Logging  toolkit  
    •  Data  pipeline  
    •  Pain  points  
    •  Outage  causes  
    •  Deployment  pracIces  
    •  EscalaIon  playbook  
    •  Measurement  as  TDD  
    •  Monitor  staging  env  

    View Slide

  20. Business  Analysts  
    •  Structured  logs    
    •  Config  for  ETL  
    •  Metrics  definiIons    
    •  Slices  and  visualizaIons  
    •  Data  size  and  cardinality  
    •  Outages  and  delays  
    •  Flexibility  
    •  VisualizaIon  and  tools  

    View Slide

  21. Data  ScienIsts  
    •  Access  to  (meta)data  
    •  Query  monitoring  
    •  StaIsIcs  and  models  
    •  New  data  streams  
    •  Context  of  data  issues  
    •  What’s  in  the  logs  
    •  Validate  algorithms  
    •  Teach  stats  and  models!  

    View Slide

  22. Product  &  ExecuIves  
    •  Curated  dashboards  
    •  Graph/alert  tools  
    •  Learn  the  business  
    •  PrioriIze  alerts  by  $  
    •  Incident  post  mortems    
    •  Metrics  granularity  
    •  Data  driven  decisions  
    •  Recognize  and  celebrate  

    View Slide

  23. Monitoring  can  become  the  data  
    plahorm  and  improve  all  teams  
    with  its  techniques.  

    View Slide

  24. Icons  from  The  Noun  Project:  Dmitry  Baranovskiy,  Benjamin  Orlovski,  Luis  Prado,  MikaDo  Nguyen,  Yarden  Gilboa,  Javier  Cabezas,  Icons  Pusher,  Jeremy  Bristol,  Blake  Thomas,  RiIka  Khasgiwale,  
    Mayene  de  Leon,  Yorlmar  Campos,  Sergey  Shmid  
    @jeff_weinstein  
    Thanks!  hiring  ;)  

    View Slide