Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Searching and Alerting at Naver

Dd9d954997353b37b4c2684f478192d3?s=47 Elastic Co
December 16, 2015

Searching and Alerting at Naver


Elastic Co

December 16, 2015

More Decks by Elastic Co

Other Decks in Technology


  1. Searching  and  Aler.ng  for  applica.on  logs  with   Elas.csearch  at

     Naver   2015/12/16   Jaeik  Lee|Seungjin  Lee    
  2. Agenda   •  Introduc.on  to  our  system   •  How

     we  use  Elas.csearch   •  Real-­‐.me  alert  using  percolator  
  3. In-­‐house  Log  System   By  Jaeik  Lee  

  4. Applica.on/Crash  Logs  in  Naver   Various  Pla+orm  &  Format  

    Common  Requirement  for  log  handling  
  5. Limita.on  of  Previous  System  NELO   Applica<ons   Log  collect

     &  aggregate   Log  Search  &  Management   Merge  BoOleneck   Slow  Search&     Limited  Query  
  6. What  we  need   •  Full  Text  Search   • 

    Unstructured  query   •  Real-­‐.me   •  Fast  search  and  aler.ng  for  developers  handle  system  fault  quickly   •  Scalable   •  As  the  number  of  logs  increase   •  Schema  Free   •  Handle  various  type  of  logs  
  7. NELO2  with  Elas.csearch  

  8. Scale   •  8  Clusters  (7  in  produc.on,  1  in

     stage)   •  229  Nodes  (152  data  nodes)   •  1.5  Billion  incoming  logs  per  one  day  (size:  2  TB)   •  Total  Documents:  105  billion  (size:  160  TB)  
  9. How  we  use  elas.csearch   By  Jaeik  Lee  

  10. Index  Model   •  1  Index  per  day  -­‐>  index

     lifecycle  management  based  on  day   •  Type  for  project  -­‐>  mapping  variance  per  project   •  Various  reten.on  .me  according  to  the  instances  (1  M,  3M,  2Y,  5Y)  
  11. Indexing  with  River  (Previous)   •  Elas.csearch  Kaha  River  plugin

      •  Consume  kaha  topics  and  index  to  elas.csearch   •  Problems   •  Performance   •  Unstable   •  Difficult  to  debug   •  Deployment  dependency  
  12. Indexing  with  Storm  (Current)   •  Guarantee  to  process  log

     (at  least  once,  exactly  once  seman.cs)   •  Easy  to  scale  out  according  to  the  amount  of  logs  
  13. Rou.ng  Basics   •  Shard  =  hash(rou.ng)  %  number  of

     primary  shards   •  Rou.ng   •  Default  rou.ng:  document  id   •  Rou.ng  parameter:  user  decide  rou.ng  value    
  14. Custom  Rou.ng   •  Use  custom  rou.ng  both  in  index

     &  search   •  Small  project:  store  only  in  one  shard  (custom  rou.ng:  project  name)   •  Big  project:  distribute  logs  over  all  shards  (default  rou.ng)  
  15. Topology  of  a  Cluster   •  Master  Nodes  (node.master:  true):

     Membership  management,  Metadata   •  Data  Nodes  (node.data:  true):  Data  store  &  processing   •  Client  Nodes  (node.master:false,  node.data:false):  load  balancer   Search   Search   Index   Index  
  16. Layering  for  cold  &  hot  data   •  Recent  1

     Week  Data  in  SSD   •  Node  AOribute  based   •  box_type:  SSD|HDD   Search   Hot  Data   Index   Warm  Data  
  17. What  we  are  improving   •  Index  Structure   • 

    Balancing  shard  distribu.on   •  Isola.ng  small  project  from  big  project   •  Mapping   •  Mul.-­‐fields:  remove  complexity  of  analyzed/not  analyzed  fields   •  Suppor.ng  numeric  types   •  Monitoring  Dashboard   •  Watching  key  metrics  for  clusters  in  one  place  
  18. Real-­‐.me  Alert   By  Seungjin  Lee  

  19. Real.me  no.fica.on  in  NELO2   •  About  what?   • 

    User  specific  condi.on  including  Elas.csearch  query,  threshold  and  interval     •  If  logs  matching  ${query}  comes  ${threshold}  .mes  within  ${interval},  no.fy   me!   •  When?   •  Immediately  ater  a  condi.on  matches,  within  a  second  
  20. Real.me  no.fica.on  in  NELO2   •  Architec.re  stack   • 

    Elas.csearch,  espicially  the  Percolator   •  Apache  Storm,  Redis,  and  Apache  Kaha   •  Data  load   •  For  1.5  billion  logs  per  day  against  2,000+  user  defined  rules  
  21. Why  real.me  no.fica.on  is  important   Application

  22. Why  real.me  no.fica.on  is  important   Application

  23. Why  real.me  no.fica.on  is  important   Application

  24. Why  real.me  no.fica.on  is  important   Application

  25. Why  real.me  no.fica.on  is  important   Application

  26. Ini.al  idea  we  had  for  a  new  no.fier  server  

  27. Percolator  API,  what  is  it?   Search


  29. Percolator  API,  what  is  it?   "id"

  30. Elas.csearch  query  as  an  alert  query   •  A  variety

     of  alert  queries  support   •  projectName:"Elas.con"  AND   (body:/.*securityfailexcep.on|.*sessionfailexcep.on/)     •  (project:"Elas.con"  OR  project:"Logstash")  AND  body:"excep.on"  NOT   source:"session-­‐request"   •  Consitent  query  syntax  both  in  search  and  alerts  
  31. Op.mizing  percolator  performance   •  Rou.ng   4IBSE  4IBSE

     4IBSE  4IBSE  4IBSE  4IBSE  percolate
  32. Op.mizing  percolator  performance   •  Load  balancing   4IBSE 

    4IBSE  4IBSE  4IBSE  4IBSE  4IBSE  percolate
  33. Op.mizing  percolator  performance   •  Filtering   4IBSE  4IBSE

  34. Op.mizing  percolator  performance   •  Filtering   4IBSE  4IBSE

  35. Real.me  cache  invalida.on   •  API  server  sends  message  to

     Kaha  when  an  alert  rule  is  updated   •  No.fica.on  server  which  is  listening  to  the  corresponding  topic  in  kaha   invalidates  the  outdated  cache  immediately  
  36. Rolling  window  aggrega.on   log

  37. Rolling  window  aggrega.on   log

  38. Rolling  window  aggrega.on   10:30:00.000

  39. Visualiza.on  of  the  data  flow   ` Elasticsearch Redis Log

    collector API server Kafka
  40. Q&A  

  41. Note  

  42. Note  

  43. Note  

  44. Note  

  45. Note  

  46. Note  

  47. www.elas.c.co