Big leap into Big Data

An introduction to big data landscape, presentation at Hibu

Santosh Sahoo

April 07, 2016

    Santosh Sahoo CTO, CircleHD & VestaLabs Ex. SAP, Hibu, Microsoft, IBM
  2. Objective. Discuss about big data, problems and solutions. Full course

    at: https://mentorbits.com/courses/bigdata-for-developers-1
  3. Hadoop MapReduce Hive (SQL) Pig (ETL) Mahaut (ML) HDFS (Storage)

    Hbase (Columnar Storage) YARN (Distributed Task Scheduler) Oozi (Workflow) HUE
  4. MR vs HiveQL 1. CREATE TABLE docs (line STRING); 2.

    LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; 3. CREATE TABLE word_counts AS 4. SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
  5. MR vs Hive vs Pig Mapreduce: Strengths: works both on

    structured and unstructured data. good for writing complex business logic. Weakness: long development type hard to achieve join functionality Hive : Strengths: less development time. suitable for adhoc analysis. easy for joins Weakness : not easy for complex business logic. deals only structured data. Pig Strengths : Structured and unstructured data. joins are easily written. Weakness: new language to learn. converted into mapreduce.
  6. Spark Core SQL Structured Data Streaming Real-time MLib Machine Learning

    GraphX Graph Data A fast and general purpose framework for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Apache Spark MapReduce Hive Pig Mahaut HDFS
  7. How is Spark faster? RDD - A Resilient Distributed Dataset,

    the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. Caching + DAG model is enough to run them efficiently Combining libraries into one program is much faster Dataset < DataFrames < schema-RDD < RDD
  8. Spark SQL context = HiveContext(sc) results = context.sql("SELECT * FROM

    people") names = results.map(lambda p: p.name)
  9. MPP - Massive Parallel Processing TerraData Netezza Vertica Greenplum* HANA

    Redshift ** Pros 1. Scalable 2. Fault - tolerant 3. Distributed Cons 1. Cost 2. Complexity 3. Speed
  10. Streaming use cases • Stock Market • Clickstream Analysis •

    Fraud Detection • Real Time bidding • Trend analysis • Real Time Data Warehousing • ...
  11. Source Flow Manager Streaming Processor Storage Dashboard Streaming Data Pipeline

    Applications Mobile Devices Sensors: IOT Database CDC Log scrapping Async Actors: Akka Message Queues Kafka Flume Azure Event hub AWS Kinesis HDFS Storm Spark Streaming Azure Stream analytics Samza Flink Heron RDBMS NoSQL HDFS DW/Redshift Custom App D3 Tableau Cognos Excel </>
  12. Spark Streaming A data processing framework to build streaming applications.

    Why? 1. Scalable 2. Fault-tolerant 3. Simpler 4. Modular 5. Code reuse
  13. But Spark vs Storm..? • Storm is a stream processing

    framework that also does micro-batching (Trident). • Spark is a batch processing framework that also does micro-batching (Spark Streaming). Also read:https://www.quora.com/What-are-the-differences-between-Apache-Spark-and-Apache-Flink/answer/Santosh-Sahoo
  14. Stream.scala 1. val conf = new SparkConf().setAppName("demoapp").setMaster("local[1]") 2. val sc

    = new SparkContext(conf) 3. val ssc = new StreamingContext(sc, Seconds(2)) 4. val kafkaConfig = Map("metadata.broker.list"->"localhost:9092") 5. val topics = Set("topic1") 6. val wordstream = KafkaUtils.createDirectStream(ssc, kafkaConfig, topics ) 7. wordstream.print() 8. ssc.start() 9. ssc.awaitTermination()
  15. Running Application spark-submit \ --class AppMain \ --master spark:// \

    #local[*] --executor-memory 20G \ --total-executor-cores 100 \ /path/to/code.jar \ 1000
  16. Composite Example // Load data using SQL points = ctx.sql(“select

    latitude, longitude from hive_tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)
  17. Apache Kafka No nonsense logging platform • 100K/s throughput vs

    20k of RabbitMQ • Log compaction • Durable persistence • Partition tolerance • Replication • Best in class integration with Spark ◦ http://spark.apache.org/docs/latest/streaming-kafka-integration.html
  18. OLTP Reporting Cognos Tableau ? Stream Processor Spark HDFS Import

    FTP HTTP SMTP P Protobuf Json Broker Kafka Hive/ Spark SQL OLAP Load balance Failover HANA HANA OLAP Replication Service bus Normalization Extract Compensate Data {Quality, Correction, Analytics} Migrate method API/SQL Expense Travel TTX API Concur Next Gen C Tachyon