Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Searching and Alerting for Application Logs with Elasticsearch at Naver

Elastic Co
December 16, 2015

Searching and Alerting for Application Logs with Elasticsearch at Naver

Naver, the biggest Internet content service provider in South Korea, develops and distributes mobile and server applications like Naver search portal, Line and Band messengers. Naver developers need to check application logs stored on hundreds of servers and sometimes only on customers' devices. NELO2 is Naver's in-house logging system, built with Elasticsearch, indexing and percolating 1.5 billion logs daily with 126TB of logs stored in seven Elasticsearch clusters.

Jae Ik Lee and Seung Jin Lee | Elastic{ON} Tour 2015 | Tokyo, Japan

Elastic Co

December 16, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. 2015/12 1 Searching  and  Aler3ng  for  applica3on  logs  with  

    Elas3csearch  at  Naver   2015/12/16   Jaeik  Lee|Seungjin  Lee     Agenda   •  Introduc3on  to  our  system   •  How  we  use  Elas3csearch   •  Real-­‐3me  alert  using  percolator  
  2. 2015/12 2 In-­‐house  Log  System   By  Jaeik  Lee  

    Applica3on/Crash  Logs  in  Naver   Various  Pla+orm  &  Format   Common  Requirement  for  log  handling  
  3. 2015/12 3 Limita3on  of  Previous  System  NELO   Applica<ons  

    Log  collect  &  aggregate   Log  Search  &  Management   Merge  BoOleneck   Slow  Search&     Limited  Query   What  we  need   •  Full  Text  Search   •  Unstructured  query   •  Real-­‐3me   •  Fast  search  and  aler3ng  for  developers  handle  system  fault  quickly   •  Scalable   •  As  the  number  of  logs  increase   •  Schema  Free   •  Handle  various  type  of  logs  
  4. 2015/12 4 NELO2  with  Elas3csearch   Scale   •  8

     Clusters  (7  in  produc3on,  1  in  stage)   •  229  Nodes  (152  data  nodes)   •  1.5  Billion  incoming  logs  per  one  day  (size:  2  TB)   •  Total  Documents:  105  billion  (size:  160  TB)  
  5. 2015/12 5 How  we  use  elas3csearch   By  Jaeik  Lee

      Index  Model   •  1  Index  per  day  -­‐>  index  lifecycle  management  based  on  day   •  Type  for  project  -­‐>  mapping  variance  per  project   •  Various  reten3on  3me  according  to  the  instances  (1  M,  3M,  2Y,  5Y)  
  6. 2015/12 6 Indexing  with  River  (Previous)   •  Elas3csearch  Kaha

     River  plugin   •  Consume  kaha  topics  and  index  to  elas3csearch   •  Problems   •  Performance   •  Unstable   •  Difficult  to  debug   •  Deployment  dependency   Indexing  with  Storm  (Current)   •  Guarantee  to  process  log  (at  least  once,  exactly  once  seman3cs)   •  Easy  to  scale  out  according  to  the  amount  of  logs  
  7. 2015/12 7 Rou3ng  Basics   •  Shard  =  hash(rou3ng)  %

     number  of  primary  shards   •  Rou3ng   •  Default  rou3ng:  document  id   •  Rou3ng  parameter:  user  decide  rou3ng  value     Custom  Rou3ng   •  Use  custom  rou3ng  both  in  index  &  search   •  Small  project:  store  only  in  one  shard  (custom  rou3ng:  project  name)   •  Big  project:  distribute  logs  over  all  shards  (default  rou3ng)  
  8. 2015/12 8 Topology  of  a  Cluster   •  Master  Nodes

     (node.master:  true):  Membership  management,  Metadata   •  Data  Nodes  (node.data:  true):  Data  store  &  processing   •  Client  Nodes  (node.master:false,  node.data:false):  load  balancer   Search   Search   Index   Index   Layering  for  cold  &  hot  data   •  Recent  1  Week  Data  in  SSD   •  Node  AOribute  based   •  box_type:  SSD|HDD   Search   Hot  Data   Index   Warm  Data  
  9. 2015/12 9 What  we  are  improving   •  Index  Structure

      •  Balancing  shard  distribu3on   •  Isola3ng  small  project  from  big  project   •  Mapping   •  Mul3-­‐fields:  remove  complexity  of  analyzed/not  analyzed  fields   •  Suppor3ng  numeric  types   •  Monitoring  Dashboard   •  Watching  key  metrics  for  clusters  in  one  place   Real-­‐3me  Alert   By  Seungjin  Lee  
  10. 2015/12 10 Real3me  no3fica3on  in  NELO2   •  About  what?

      •  User  specific  condi3on  including  Elas3csearch  query,  threshold  and  interval     •  If  logs  matching  ${query}  comes  ${threshold}  3mes  within  ${interval},  no3fy   me!   •  When?   •  Immediately  ater  a  condi3on  matches,  within  a  second   Real3me  no3fica3on  in  NELO2   •  Architec3re  stack   •  Elas3csearch,  espicially  the  Percolator   •  Apache  Storm,  Redis,  and  Apache  Kaha   •  Data  load   •  For  1.5  billion  logs  per  day  against  2,000+  user  defined  rules  
  11. 2015/12 16 Op3mizing  percolator  performance   •  Rou3ng   4IBSE

     4IBSE  4IBSE  4IBSE  4IBSE  4IBSE  percolate
  12. 2015/12 18 Real3me  cache  invalida3on   •  API  server  sends

     message  to  Kaha  when  an  alert  rule  is  updated   •  No3fica3on  server  which  is  listening  to  the  corresponding  topic  in  kaha   invalidates  the  outdated  cache  immediately   Rolling  window  aggrega3on   log