Big Data Journey

© 2015 MapR Technologies ‹#› Big Data Journey with Hadoop
& MapR Tug Grall [email protected] @tgrall

© 2015 MapR Technologies ‹#› Big Data Journey Tug Grall
[email protected] @tgrall Tug Grall [email protected] @tgrall David Pilato [email protected] @dadoonet

https://www.domo.com/

Building new applications

Can I use my existing tools?

(Big) Data Platform (Big) Data Project

Ingest Store Process Consume

Ingest Data

Copy files in HDFS hadoop fs -put dailylogs-log.zip /logs/2015/09/10/

Import RDBMS data sqoop import --connect jdbc:mysql://db.foo.com/somedb --table \ customers
--target-dir /incremental_dataset --append Files HBase Hive

Import RDBMS data input { jdbc { jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb"
jdbc_user => "postgres" jdbc_driver_library => "/path/to/postgresql-9.4-1201.jdbc41.jar" jdbc_driver_class => "org.postgresql.Driver" statement => "SELECT * from contacts" } }

What’s “wrong”? Batch????

Streaming Flume, Kafka, Logstash to the rescue

Log App Events Twitter Sensors … HDFS MapR-FS Alerts Elasticsearch
… DB

Log App Events Twitter Sensors … HDFS MapR-FS Alerts Elasticsearch
… DB Broker Producers Consumers

Stream data into Hadoop using Flume Server Files HBase Hive
Server Server Server

Streams using Kafka Files HBase Hive Producer Producer Producer Consumer
Consumer Consumer Alert

Stream data using Logstash

Data Storage Data Format

How to store your data? • Files in a distributed
file system • Rows in NoSQL Table • Index in Search Engine

Process Data

Data Processing • Transform the data • Enrich the data
• Examples: • Store data in multiple formats • Aggregate data • Build Recommendations • ….

MapReduce Processing Model • Define mappers • Shuffling is automatic
• Define reducers • For complex work, chain jobs together – Use a higher level language or DSL that does this for you

Apache Spark: Fast Big Data – Rich APIs in Java,
Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage

Spark: Unified Platform Spark SQL Spark Streaming (Streaming) MLlib (Machine
learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN

Elasticsearch / Watcher

Query the data

Files HBase Hive Index Discovery/Analytics

SQL strikes back!

Files HBase Hive SQL on Hadoop • SQL Shell •
JDBC ODBC • BI Tools • Reporting

Elasticsearch

Kibana as a frontend

Example: Recommendation Platform

Machine Learning MapR Cluster HBase  MapR DB MapR-FS Add recommendations
to movies Capture Ratings Movies & Recommendations Movie Database

Conclusion • If possible use Streams: Kafka, Logstash  • Advanced
Data Processing and Machine Learning : Spark • Expose your data using SQL for your “BI folks” : Drill • Aggregation and Full Text Search : Elasticsearch • Data Visualisation : Kibana

© 2015 MapR Technologies ‹#› Big Data Journey Tug Grall
[email protected] @tgrall Tug Grall [email protected] @tgrall David Pilato [email protected] @dadoonet

Big Data Journey

Big Data Journey

More Decks by Tugdual Grall

Other Decks in Technology

Featured

Transcript