Big Data Journey - Speaker Deck

Slide 1

Slide 1 text

Slide 2

Slide 2 text

YARN

Slide 3

Slide 3 text

Slide 4

Slide 4 text

YARN

Slide 5

Slide 5 text

WHY?

Slide 6

Slide 6 text

https://www.domo.com/

Slide 7

Slide 7 text

Building new applications

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Can I use my existing tools?

Slide 10

Slide 10 text

(Big) Data Platform (Big) Data Project

Slide 11

Slide 11 text

Ingest Store Process Consume

Slide 12

Slide 12 text

Ingest Data

Slide 13

Slide 13 text

Copy files in HDFS hadoop fs -put dailylogs-log.zip /logs/2015/09/10/

Slide 14

Slide 14 text

Import RDBMS data sqoop import --connect jdbc:mysql://db.foo.com/somedb --table \ customers --target-dir /incremental_dataset --append Files HBase Hive

Slide 15

Slide 15 text

Import RDBMS data input { jdbc { jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb" jdbc_user => "postgres" jdbc_driver_library => "/path/to/postgresql-9.4-1201.jdbc41.jar" jdbc_driver_class => "org.postgresql.Driver" statement => "SELECT * from contacts" } }

Slide 16

Slide 16 text

What’s “wrong”? Batch????

Slide 17

Slide 17 text

Streaming Flume, Kafka, Logstash to the rescue

Slide 18

Slide 18 text

Log App Events Twitter Sensors … HDFS MapR-FS Alerts Elasticsearch … DB

Slide 19

Slide 19 text

Log App Events Twitter Sensors … HDFS MapR-FS Alerts Elasticsearch … DB Broker Producers Consumers

Slide 20

Slide 20 text

Stream data into Hadoop using Flume Server Files HBase Hive Server Server Server

Slide 21

Slide 21 text

Streams using Kafka Files HBase Hive Producer Producer Producer Consumer Consumer Consumer Alert

Slide 22

Slide 22 text

Stream data using Logstash

Slide 23

Slide 23 text

Data Storage Data Format

Slide 24

Slide 24 text

How to store your data? • Files in a distributed file system • Rows in NoSQL Table • Index in Search Engine

Slide 25

Slide 25 text

Process Data

Slide 26

Slide 26 text

Data Processing • Transform the data • Enrich the data • Examples: • Store data in multiple formats • Aggregate data • Build Recommendations • ….

Slide 27

Slide 27 text

MapReduce Processing Model • Define mappers • Shuffling is automatic • Define reducers • For complex work, chain jobs together – Use a higher level language or DSL that does this for you

Slide 28

Slide 28 text

Apache Spark: Fast Big Data – Rich APIs in Java, Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage

Slide 29

Slide 29 text

Spark: Unified Platform Spark SQL Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN

Slide 30

Slide 30 text

Elasticsearch / Watcher

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

Query the data

Slide 33

Slide 33 text

Files HBase Hive Index Discovery/Analytics

Slide 34

Slide 34 text

SQL strikes back!

Slide 35

Slide 35 text

Files HBase Hive SQL on Hadoop • SQL Shell • JDBC ODBC • BI Tools • Reporting

Slide 36

Slide 36 text

Elasticsearch

Slide 37

Slide 37 text

Kibana as a frontend

Slide 38

Slide 38 text

Example: Recommendation Platform

Slide 39

Slide 39 text

Machine Learning MapR Cluster HBase  MapR DB MapR-FS Add recommendations to movies Capture Ratings Movies & Recommendations Movie Database

Slide 40

Slide 40 text

Conclusion • If possible use Streams: Kafka, Logstash  • Advanced Data Processing and Machine Learning : Spark • Expose your data using SQL for your “BI folks” : Drill • Aggregation and Full Text Search : Elasticsearch • Data Visualisation : Kibana