Big leap into Big Data

Big leap into Big data Hadoop and Spark Presented by
Santosh Sahoo CTO, CircleHD & VestaLabs Ex. SAP, Hibu, Microsoft, IBM

Objective. Discuss about big data, problems and solutions. Full course
at: https://mentorbits.com/courses/bigdata-for-developers-1

What is big data?

Why is it a big deal?

SQL vs NoSQL

What the hell?? Hadoop, Spark, Impala, Hbase, Cassandra, Kafka, Samza,
Storm, Flume, Oozi

Analytics Platform

Hadoop MapReduce Hive (SQL) Pig (ETL) Mahaut (ML) HDFS (Storage)
Hbase (Columnar Storage) YARN (Distributed Task Scheduler) Oozi (Workflow) HUE

MR vs HiveQL 1. CREATE TABLE docs (line STRING); 2.
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; 3. CREATE TABLE word_counts AS 4. SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;

MR vs Hive vs Pig Mapreduce: Strengths: works both on
structured and unstructured data. good for writing complex business logic. Weakness: long development type hard to achieve join functionality Hive : Strengths: less development time. suitable for adhoc analysis. easy for joins Weakness : not easy for complex business logic. deals only structured data. Pig Strengths : Structured and unstructured data. joins are easily written. Weakness: new language to learn. converted into mapreduce.

Batch operation 1

Process data faster* - Parallelize it - Cache it -
Evaluate Lazily - Stream - ...

Spark Core SQL Structured Data Streaming Real-time MLib Machine Learning
GraphX Graph Data A fast and general purpose framework for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Apache Spark MapReduce Hive Pig Mahaut HDFS

How is Spark faster? RDD - A Resilient Distributed Dataset,
the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. Caching + DAG model is enough to run them efficiently Combining libraries into one program is much faster Dataset < DataFrames < schema-RDD < RDD

Spark SQL context = HiveContext(sc) results = context.sql("SELECT * FROM
people") names = results.map(lambda p: p.name)

MPP - Massive Parallel Processing TerraData Netezza Vertica Greenplum* HANA
Redshift ** Pros 1. Scalable 2. Fault - tolerant 3. Distributed Cons 1. Cost 2. Complexity 3. Speed

Real-time Processing Continuously processing stream of data, events or logs..

Streaming use cases • Stock Market • Clickstream Analysis •
Fraud Detection • Real Time bidding • Trend analysis • Real Time Data Warehousing • ...

Source Flow Manager Streaming Processor Storage Dashboard Streaming Data Pipeline
Applications Mobile Devices Sensors: IOT Database CDC Log scrapping Async Actors: Akka Message Queues Kafka Flume Azure Event hub AWS Kinesis HDFS Storm Spark Streaming Azure Stream analytics Samza Flink Heron RDBMS NoSQL HDFS DW/Redshift Custom App D3 Tableau Cognos Excel </>

Spark Streaming A data processing framework to build streaming applications.
Why? 1. Scalable 2. Fault-tolerant 3. Simpler 4. Modular 5. Code reuse

But Spark vs Storm..? • Storm is a stream processing
framework that also does micro-batching (Trident). • Spark is a batch processing framework that also does micro-batching (Spark Streaming). Also read:https://www.quora.com/What-are-the-differences-between-Apache-Spark-and-Apache-Flink/answer/Santosh-Sahoo

Stream.scala 1. val conf = new SparkConf().setAppName("demoapp").setMaster("local[1]") 2. val sc
= new SparkContext(conf) 3. val ssc = new StreamingContext(sc, Seconds(2)) 4. val kafkaConfig = Map("metadata.broker.list"->"localhost:9092") 5. val topics = Set("topic1") 6. val wordstream = KafkaUtils.createDirectStream(ssc, kafkaConfig, topics ) 7. wordstream.print() 8. ssc.start() 9. ssc.awaitTermination()

Running Application spark-submit \ --class AppMain \ --master spark://192.168.10.21:7077 \
#local[*] --executor-memory 20G \ --total-executor-cores 100 \ /path/to/code.jar \ 1000

Composite Example // Load data using SQL points = ctx.sql(“select
latitude, longitude from hive_tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

Apache Kafka No nonsense logging platform • 100K/s throughput vs
20k of RabbitMQ • Log compaction • Durable persistence • Partition tolerance • Replication • Best in class integration with Spark ◦ http://spark.apache.org/docs/latest/streaming-kafka-integration.html

OLTP Reporting Cognos Tableau ? Stream Processor Spark HDFS Import
FTP HTTP SMTP P Protobuf Json Broker Kafka Hive/ Spark SQL OLAP Load balance Failover HANA HANA OLAP Replication Service bus Normalization Extract Compensate Data {Quality, Correction, Analytics} Migrate method API/SQL Expense Travel TTX API Concur Next Gen C Tachyon

Lambda Architecture

Thank You! @santoshsaho speakerdeck.com/santoshsahoo github.com/santoshsahoo/spark-streaming-deepdive linkedin.com/in/sahoosantosh

Big leap into Big Data

Big leap into Big Data

Santosh Sahoo

More Decks by Santosh Sahoo

Other Decks in Technology

Featured

Transcript