Massive Data Aggregation Processing, Monitoring and Visualization with Apache Flume, ElasticSearch and D3.js

Massive Data Aggrega-on, Processing, Monitoring and Visualiza-on
with Apache Flume, Elas-cSearch and D3.js Part I Israel Ekpo

(650) 318-‐1195 * [email protected] •  Father, Husband, Son and
Brother •  Computer Scien7st •  Big Data Enthusiast •  Data Science Prac77oner •  Contributor to Open Source Projects •  Loves to learn About the Presenter

About the Tools •  Apache Flume (NG) – Data
Aggrega7on •  Elas7cSearch – Full text Search •  HDFS – Distributed File System •  D3.js – Data Visualiza7on

Sources of Data •  Applica7on-‐Generated Data •  Network
Traﬃc •  Social Media :TwiTer, Google+, Facebook •  Email Sources: Mailing List Subscrip7ons

Summary of Flume Architecture

Key Concepts •  Event – Basic unit of data
•  Source – Receives events into Flume •  Channel – Buﬀers events for pickup later •  Sink – Picks up events from channel •  Source Interceptors •  Channel Selectors (Replica7ng/Mul7plexing)

Anatomy of an Event An event is the basic
unit of data within Flume { headers : [ “nameOfHeader1” : “valueOfHeader1”, “nameOfHeader2” : “valueOfHeader2”, “nameOfHeader3” : “valueOfHeader3”, “nameOfHeader4” : “valueOfHeader4”, “nameOfHeader5” : “valueOfHeader5” ], body: “This is the body of the event ” }

Source Interceptors Used to modify/drop events in ﬂight.
•  Timestamp Interceptor •  Host Interceptor •  Sta7c Interceptor •  Regex Filtering Interceptor •  Regex Extractor Interceptor •  Custom Interceptors (of course)

Custom Interceptors org.apache.ﬂume.interceptor.Interceptor void ini-alize(); Event
intercept(Event event); List<Event> intercept(List<Event>); void close();

Channel Selectors •  Replica-ng Selector – duplicates single event
to one or more channels. •  Mul-plexing Selector – contextually selects which channels to route an event to depending on values in the event header.

Data Inges-on: Flume Sources •  HTTP Source • 
Avro Source •  Spooling Directory Source •  Exec Source •  NetCat Source •  Syslog (TCP and UDP) •  Thrij Source •  Scribe Source

Data Inges-on: Flume Sources Custom Sources If the
built-‐in sources that are shipped with Flume are unable to sa7sfy your needs, you can easily create a custom ﬂume source that takes in data in the format you want and forwards them to the next phase (channels).

Channels: Buﬀering/Storage •  Memory Channel – Vola7le, Faster
•  File Channel – Persistent, Reliable Stores events received from sources un7l they are ready to be drained by a sink.

Channels: Buﬀering/Storage Custom Channels If you have needs
that cannot be met with the built-‐in channels that ship with Flume, you can also create your own custom channel.

Sinks: Storage Drains the events from the channels to
a centralized data store: •  Elas7cSearch Sink •  HDFS Sink •  Avro Sink •  IRC Sink •  HBase Sink •  AsyncHBase Sink

Sinks: Storage Custom Sinks If you have diﬀerent
needs, you can also roll out your own custom sinks to storage endpoints like: •  Apache Solr •  CouchBase •  MongoDB •  Neo4j

Elas-cSearch Sink •  Retrieves events from the channel.
•  Serializes event into an Elas7cSearch doc. •  Documents are buﬀered and sent as a batch. •  Sends docs in bulk to the Elas7cSearch server. •  Commits the channel transac7on if successful. •  Repeats retrieval of new events from channel

Event to Elas-cSearch Document POST hTp://localhost:9200/indexName/mappingName {
h[pResponseCode: “500”, url: “/company-‐products/Q3D7F6AD5”, ipAddress: “192.168.0.250”, browser: “Google Chrome”, -mestamp: “2013-‐05-‐18T13:55:48”, body: “Internal Server error while trying to …”, }

Retrieving Events from ES Searching on All Indices
POST hTp://localhost:9200/_search?preTy=true { “query”: { “matchAll” : { } } }

Retrieving Events from ES Searching on Mul7ple Indices
POST hTp://localhost:9200/index1,index2/_search?preTy=true { "query" : { "range" : { "postDate" : { "from" : "2013-‐05-‐15T13:00:00", "to" : "2013-‐05-‐18T14:00:00" } } } }

Retrieving Events from ES Searching on Speciﬁc Indices
POST hTp://localhost:9200/index20130518/_search?preTy=true { “query”: { “term” : { “hTpResponseCode” : 500 } } }

Analyzing Query Response from ES { hits :
{ total : 1573 hits : [ { document 1}, { document 2}, { document 3}, { document n} ] } }

A Picture is Worth 1024 words This is where
D3.js comes in

A Picture is Worth 1024 words This is where
D3.js comes in 200 302 400 401 403 500 504 88000 15000 3500 3200 4800 6400 5500 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 1 2 3 4 5 6 7 HTTP Response Codes Frequency for 2013-‐05-‐17

Where to get help Ques-ons about how to use
Flume user-‐subscribe@flume.apache.org Ques-ons about Flume API and code dev-‐subscribe@flume.apache.org User and Developer Guides h[p://flume.apache.org/documenta-on.html

Where to get help Elas-c Search Google Group
h[ps://groups.google.com/forum/?fromgroups#! forum/elas-csearch Elas-c Search IRC Channel irc://irc.freenode.net/elas-csearch User Guide h[p://www.elas-csearch.org/guide/

Where to get help Oﬃcial Website for D3
D3js.org Tutorials h[ps://github.com/mbostock/d3/wiki/Tutorials IRC Channel irc://irc.freenode.net/d3.js

Where to get help •  Amazon.com •  SafariBooksOnline.com
•  StackOverﬂow.com •  Google.com •  Bing.com •  (650) 318-‐1195 •  [email protected]

Ques-ons ?

Massive Data Aggregation Processing, Monitoring...

Massive Data Aggregation Processing, Monitoring and Visualization with Apache Flume, ElasticSearch and D3.js

Israel Ekpo

More Decks by Israel Ekpo

Other Decks in Technology

Featured

Transcript

Massive Data Aggrega-on, Processing, Monitoring and Visualiza-on

(650) 318-‐1195 * [email protected] •  Father, Husband, Son and

About the Tools •  Apache Flume (NG) – Data

Sources of Data •  Applica7on-‐Generated Data •  Network

Summary of Flume Architecture

Key Concepts •  Event – Basic unit of data

Anatomy of an Event An event is the basic

Source Interceptors Used to modify/drop events in ﬂight.

Custom Interceptors org.apache.ﬂume.interceptor.Interceptor void ini-alize(); Event

Channel Selectors •  Replica-ng Selector – duplicates single event

Data Inges-on: Flume Sources •  HTTP Source •

Data Inges-on: Flume Sources Custom Sources If the

Channels: Buﬀering/Storage •  Memory Channel – Vola7le, Faster

Channels: Buﬀering/Storage Custom Channels If you have needs

Sinks: Storage Drains the events from the channels to

Sinks: Storage Custom Sinks If you have diﬀerent

Elas-cSearch Sink •  Retrieves events from the channel.

Event to Elas-cSearch Document POST hTp://localhost:9200/indexName/mappingName {

Retrieving Events from ES Searching on All Indices

Retrieving Events from ES Searching on Mul7ple Indices

Retrieving Events from ES Searching on Speciﬁc Indices

Analyzing Query Response from ES { hits :

A Picture is Worth 1024 words This is where

A Picture is Worth 1024 words This is where

Where to get help Ques-ons about how to use

Where to get help Elas-c Search Google Group

Where to get help Oﬃcial Website for D3

Where to get help •  Amazon.com •  SafariBooksOnline.com

Ques-ons ?