Big Data for the Rest of us

Big Data for the Rest of Us: Understanding the
Emerging Hadoop Ecosystem Marcus Ross + Peter Dickten PD

The crazy Germans from OSBC2012 are back ☺ –  Peter
Dickten (@pe_d) •  CEO of a development company (DCS) •  Lot's of development for market research –  Marcus Ross (@zahlenhelfer) •  Trainer + consultant for database systems / BI Who are we?

Todays journey •  What is Big Data? • 
What is Hadoop and how does it work? •  Ecosystem – Hadoop AddOns / Tools – Services and Infrastructure •  Summary

"Big Data"? •  2012-‐? are the years of
Big Data •  Thanks to markeXng almost everything is now a big data soluXon

What's the Data in Big Data? •  All kinds
of structured informaXon, e.g. –  Web server log ﬁles –  Flight data –  Purchase data •  Most of the Xme: simply structured •  But: insane amount of records ("big")

What do you mean with "big"? Much more informaXon
than a single system can efficiently store / process •  "efficiently" (Xme) depends on the use case (e.g. real Xme analysis for fraud detecXon) •  "efficiently" (cost): Database system for Petabytes of data could get extremely expensive

Basic idea: Use thousands of cheap computers with cheap
storage instead of one/few expensive supercomputers …and yes… that sounds like Google: •  Map/Reduce (USP 7,650,331) •  GFS (closed source)

downsides of 1,000 cheap computers •  MTBF : from
50 years to 21 days => store data redundantly (e.g. in 3 diﬀerent places) •  Splihng a computaXon to 1,000 computaXons can be diﬃcult •  Only works if every machine only needs a small subset of the data ("data locality") to do its job. MTBF (mean Xme between failure) = average life Xme, e.g. 500K hours for a hard disk

Worst problem of all: transport Xmes Activity Time in
ns Reading from L1 cache 0,5 Reading from memory 100 Reading 4K randomly from SSD (*) 150,000 Round trip within same datacenter 500,000 Send packet CA->Netherlands->CA 150,000,000 Source: hjp://norvig.com/21-‐days.html#answers (*) assuming 1GB/s reads => Data retrieval from other machines will KILL performance

The "magic" of Hadoop (simpliﬁed) •  Hadoop will store
data mulXple Xmes (HDFS) and keep track of failing machines •  Hadoop will split up work to many machines (map/reduce) •  Hadoop delegates work to machines which already have the data needed for the job

What Hadoop is not •  Drop-‐in replacement for
SQL •  Chart/Report Generator •  An Excel Add-‐In •  Plug´n´Play BI-‐Suite •  SAP Data Warehouse MR

Example: Splihng up Work •  QuesXon: What is the
most used word in "Hamlet" by Shakespeare? (Hint: 3.73 %) •  Hadoop distributes a diﬀerent porXon (e.g. page/ sentence) of the text to each machine •  Every machine counts the words in its porXon •  Hadoop combines the results of the machines and picks the highest result. *Map *Reduce *Distribute *“the“

Real world use case: energy discovery Chevron uses Hadoop
to sort and process data from ships that troll the ocean collecXng seismic data that might signify the presence of oil reserves.

Real world use case: IT Security ipTrust uses Hadoop
to assign reputaXon scores to IP addresses, which lets other security products decide whether to accept traﬃc from those IP addresses or not.

NOW HOW CAN WE USE HADOOP? Sounds good …
PD

Vanilla Hadoop Coding map/reduce algorithms in Java

This looks tedious .. is there an app for that?
Lots of them … prepare for the ride: •  Pig •  Hive •  Flume •  Mahout •  Zookeeper •  Scoop •  Hbase •  Ambari

Apache Pig -‐ the Data Omnivore •  Developed by
Yahoo, now Apache project •  "Pig LaXn" language is translated to map/reduce •  Basic idea: "data ﬂow" – the data is transformed step-‐by-‐ step using built-‐in/self-‐wrijen steps (e.g. ﬁlter, group-‐by, join, foreach...). The output of each step can be used as input for the next step (using variables) •  Similar to interacXve shells/read–eval–print loop tools

Apache Pig -‐ example log = LOAD 'server.log' AS
(user,time,query); grpd = GROUP log BY user; cntd = FOREACH grpd GENERATE group, count(log); STORE cntd INTO 'output.txt'; Computation will start as soon as the result is written by STORE (or other commands like DUMP)

Hive – select * from Hadoop Developed by Facebook,
now Apache. An SQL-‐like query language (HiveQL) for Hadoop. Can be extended with map/reduce code wrijen in Java Similar to SQL (but not intended for ad-‐hoc queries)

Hive -‐ Example •  LOAD DATA LOCAL INPATH "ciXes.txt"
OVERWRITE INTO TABLE CITE; •  SELECT * FROM cite LIMIT 10; •  INSERT OVERWRITE cite_count SELECT cited, count(ciXng) FROM cite GROUP BY cited; •  SELECT * FROM cite_count WHERE count>10;

hjps://twijer.com/JulianHi/status/457668218753392642/photo/1 MR

A collecXon of machine learning algorithms focused on
•  CollaboraXve ﬁltering (recommendaXon) •  Clustering (grouping of enXXes based on similar characterisXcs) •  ClassiﬁcaXon (clustering in pre-‐ exisXng groups) (mostly) based on Hadoop. Apache Mahout hjps://twijer.com/JulianHi/status/457668218753392642/photo/1

Apache Flume •  Is a distributed, reliable, and high
available service for eﬃciently collecXng, aggregaXng, and moving large amounts of log data. (some kind of ETL) •  It uses three parts: –  Agent (receive data from an applicaXon/log) –  Processor (intermediate processing) –  Collector (write data to permanent storage) •  Use it to have a framework for import instead of develop an importer for each source

Apache Zookeeper •  Providing an open source distributed
–  ConﬁguraXon service –  SynchronizaXon service –  Naming registry for large distributed systems •  Its architecture supports high-‐availability through redundant services •  ZooKeeper is used by companies like Rackspace, Yahoo! and eBay

Apache Sqoop •  Eﬃciently transferring bulk data between
Hadoop and structured datastores (relaXonal databases) •  For example, to import data and store a CSV ﬁle in a directory in HDFS: sqoop import --connect <JDBC connection string> --table <tablename> --username <username> --password <password>

HBase •  HBase is for random, realXme read/write
access for Big Data •  This project's goal is to host very large tables •  Use it for Billions of rows X millions of columns •  Hosted on clusters of commodity hardware •  It´s a distributed, versioned, non-‐relaXonal database modeled a~er Google's Bigtable •  It provides Bigtable-‐like capabiliXes on top of Hadoop/HDFS.

Ambari •  Aimed at making Hadoop management simpler
•  Use it for Hadoop clusters to –  Provision –  Manage –  Monitor •  Easy-‐to-‐use management web UI •  Plus RESTful APIs

Ambari by example

Oozie •  Oozie is a workflow scheduler system to
manage Apache Hadoop jobs. •  Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of acXons. •  Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by Xme (frequency) and data availabilty.

DistribuXons •  You get an out of the box
system •  Most Hadoop tools are already installed •  A complete linux system + enhancements

Suites •  No more coding in Java needed
•  But mostly only "one purpose" apps •  So~ware packages with BigData inside –  Splunk –  Talend –  Penthao –  Terradata –  Microso~ HDInsight

Not only on premise •  Hadoop can run in
the cloud – Amazon – Microso~ •  No infrastructure needed •  Scale your processXme by your needs hjp://www.chg-‐computer.de/uploads/mediapool/cloud.jpg

Management Summary •  Hadoop can help you with your
data! •  Hadoop and RDBMS will coexist •  Ecosystem reduces development costs/Xme •  MulXple ﬂavors of Hadoop: – Cloud / Hosted / On Premise – Ready to use-‐DistribuXons

Big Data for the Rest of us

Big Data for the Rest of us

More Decks by Marcus Ross

Other Decks in Technology

Featured

Transcript