Massive Data Aggregation Processing, Monitoring and Visualization with Apache Flume, ElasticSearch and D3.js

Massive Data Aggregation Processing, Monitoring and Visualization with Apache Flume, ElasticSearch and D3.js

Introduces users to Apache Flume and how it can be used to aggregate data from hundreds or thousands of servers to a centralized data store such as ElasticSearch, Apache Solr or HDFS.

It then introduces the users to visualization concepts with the data retrieved from the centralized datastore.

976f826c48eb74fc39d0aa4def9c99f3?s=128

Israel Ekpo

May 18, 2013
Tweet

Transcript

  1. Massive  Data  Aggrega-on,     Processing,  Monitoring  and  Visualiza-on  

      with     Apache  Flume,  Elas-cSearch  and  D3.js     Part  I   Israel  Ekpo  
  2. (650)  318-­‐1195  *  BigData@aicer.org   •  Father,  Husband,  Son  and

     Brother   •  Computer  Scien7st   •  Big  Data  Enthusiast   •  Data  Science  Prac77oner   •  Contributor  to  Open  Source  Projects   •  Loves  to  learn   About  the  Presenter  
  3. About  the  Tools   •  Apache  Flume  (NG)  –  Data

     Aggrega7on   •  Elas7cSearch  –  Full  text  Search   •  HDFS  –  Distributed  File  System   •  D3.js  –  Data  Visualiza7on  
  4. Sources  of  Data   •  Applica7on-­‐Generated  Data   •  Network

     Traffic   •  Social  Media  :TwiTer,  Google+,  Facebook   •  Email  Sources:  Mailing  List  Subscrip7ons  
  5. Summary  of  Flume  Architecture  

  6. Key  Concepts   •  Event  –  Basic  unit  of  data

      •  Source  –  Receives  events  into  Flume   •  Channel  –  Buffers  events  for  pickup  later   •  Sink  –  Picks  up  events  from  channel   •  Source  Interceptors   •  Channel  Selectors  (Replica7ng/Mul7plexing)  
  7. Anatomy  of  an  Event   An  event  is  the  basic

     unit  of  data  within  Flume   {    headers  :  [      “nameOfHeader1”  :  “valueOfHeader1”,      “nameOfHeader2”  :  “valueOfHeader2”,      “nameOfHeader3”  :  “valueOfHeader3”,      “nameOfHeader4”  :  “valueOfHeader4”,      “nameOfHeader5”  :  “valueOfHeader5”    ],    body:  “This  is  the  body  of  the  event  ”   }  
  8. Source  Interceptors   Used  to  modify/drop  events  in  flight.  

    •  Timestamp  Interceptor   •  Host  Interceptor   •  Sta7c  Interceptor   •  Regex  Filtering  Interceptor   •  Regex  Extractor  Interceptor   •  Custom  Interceptors  (of  course)  
  9. Custom  Interceptors   org.apache.flume.interceptor.Interceptor     void  ini-alize();   Event

     intercept(Event  event);   List<Event>  intercept(List<Event>);   void  close();  
  10. Channel  Selectors   •  Replica-ng  Selector  –  duplicates  single  event

      to  one  or  more  channels.   •  Mul-plexing  Selector  –  contextually  selects   which  channels  to  route  an  event  to   depending  on  values  in  the  event  header.  
  11. Data  Inges-on:  Flume  Sources   •  HTTP  Source   • 

    Avro  Source   •  Spooling  Directory  Source   •  Exec  Source   •  NetCat  Source   •  Syslog  (TCP  and  UDP)   •  Thrij  Source   •  Scribe  Source  
  12. Data  Inges-on:  Flume  Sources   Custom  Sources   If  the

     built-­‐in  sources  that  are  shipped  with   Flume  are  unable  to  sa7sfy  your  needs,  you  can   easily  create  a  custom  flume  source  that  takes  in   data  in  the  format  you  want  and  forwards  them   to  the  next  phase  (channels).  
  13. Channels:  Buffering/Storage   •  Memory  Channel  –  Vola7le,  Faster  

    •  File  Channel  –  Persistent,  Reliable   Stores  events  received  from  sources  un7l  they   are  ready  to  be  drained  by  a  sink.  
  14. Channels:  Buffering/Storage   Custom  Channels   If  you  have  needs

     that  cannot  be  met  with  the   built-­‐in  channels  that  ship  with  Flume,  you  can   also  create  your  own  custom  channel.  
  15. Sinks:  Storage   Drains  the  events  from  the  channels  to

     a   centralized  data  store:     •  Elas7cSearch  Sink   •  HDFS  Sink   •  Avro  Sink   •  IRC  Sink   •  HBase  Sink   •  AsyncHBase  Sink  
  16. Sinks:  Storage   Custom  Sinks   If  you  have  different

     needs,  you  can  also  roll  out   your  own  custom  sinks  to  storage  endpoints   like:     •  Apache  Solr   •  CouchBase   •  MongoDB   •  Neo4j  
  17. Elas-cSearch  Sink   •  Retrieves  events  from  the  channel.  

    •  Serializes  event  into  an  Elas7cSearch  doc.   •  Documents  are  buffered  and  sent  as  a  batch.   •  Sends  docs  in  bulk  to  the  Elas7cSearch  server.   •  Commits  the  channel  transac7on  if  successful.   •  Repeats  retrieval  of  new  events  from  channel  
  18. Event  to  Elas-cSearch  Document   POST  hTp://localhost:9200/indexName/mappingName   {  

     h[pResponseCode:  “500”,    url:  “/company-­‐products/Q3D7F6AD5”,    ipAddress:  “192.168.0.250”,    browser:  “Google  Chrome”,    -mestamp:  “2013-­‐05-­‐18T13:55:48”,    body:  “Internal  Server  error  while  trying  to  …”,   }  
  19. Retrieving  Events  from  ES   Searching  on  All  Indices  

      POST  hTp://localhost:9200/_search?preTy=true   {    “query”:  {      “matchAll”  :  {  }    }   }  
  20. Retrieving  Events  from  ES   Searching  on  Mul7ple  Indices  

      POST  hTp://localhost:9200/index1,index2/_search?preTy=true   {            "query"  :  {                    "range"  :  {                            "postDate"  :  {  "from"  :  "2013-­‐05-­‐15T13:00:00",  "to"  :   "2013-­‐05-­‐18T14:00:00"  }                    }            }     }  
  21. Retrieving  Events  from  ES   Searching  on  Specific  Indices  

      POST  hTp://localhost:9200/index20130518/_search?preTy=true   {    “query”:  {      “term”  :  {          “hTpResponseCode”  :  500      }    }   }  
  22. Analyzing  Query  Response  from  ES   {    hits  :

     {        total  :  1573      hits  :  [        {  document  1},        {  document  2},        {  document  3},        {  document  n}      ]    }   }  
  23. A  Picture  is  Worth  1024  words   This  is  where

     D3.js  comes  in  
  24. A  Picture  is  Worth  1024  words   This  is  where

     D3.js  comes  in   200   302   400   401   403   500   504   88000   15000   3500   3200   4800   6400   5500   0   10000   20000   30000   40000   50000   60000   70000   80000   90000   100000   1   2   3   4   5   6   7   HTTP  Response  Codes   Frequency  for  2013-­‐05-­‐17  
  25. Where  to  get  help   Ques-ons  about  how  to  use

     Flume   user-­‐subscribe@flume.apache.org     Ques-ons  about  Flume  API  and  code   dev-­‐subscribe@flume.apache.org     User  and  Developer  Guides   h[p://flume.apache.org/documenta-on.html      
  26. Where  to  get  help   Elas-c  Search  Google  Group  

    h[ps://groups.google.com/forum/?fromgroups#! forum/elas-csearch     Elas-c  Search  IRC  Channel   irc://irc.freenode.net/elas-csearch     User  Guide   h[p://www.elas-csearch.org/guide/      
  27. Where  to  get  help   Official  Website  for  D3  

    D3js.org     Tutorials   h[ps://github.com/mbostock/d3/wiki/Tutorials   IRC  Channel   irc://irc.freenode.net/d3.js  
  28. Where  to  get  help   •  Amazon.com   •  SafariBooksOnline.com

      •  StackOverflow.com   •  Google.com   •  Bing.com   •  (650)  318-­‐1195   •  BigData@aicer.org  
  29. Ques-ons     ?