Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big leap into Big Data

Big leap into Big Data

An introduction to big data landscape, presentation at Hibu

Santosh Sahoo

April 07, 2016
Tweet

More Decks by Santosh Sahoo

Other Decks in Technology

Transcript

  1. Big leap into Big data Hadoop and Spark Presented by

    Santosh Sahoo CTO, CircleHD & VestaLabs Ex. SAP, Hibu, Microsoft, IBM
  2. Objective. Discuss about big data, problems and solutions. Full course

    at: https://mentorbits.com/courses/bigdata-for-developers-1
  3. Hadoop MapReduce Hive (SQL) Pig (ETL) Mahaut (ML) HDFS (Storage)

    Hbase (Columnar Storage) YARN (Distributed Task Scheduler) Oozi (Workflow) HUE
  4. MR vs HiveQL 1. CREATE TABLE docs (line STRING); 2.

    LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; 3. CREATE TABLE word_counts AS 4. SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
  5. MR vs Hive vs Pig Mapreduce: Strengths: works both on

    structured and unstructured data. good for writing complex business logic. Weakness: long development type hard to achieve join functionality Hive : Strengths: less development time. suitable for adhoc analysis. easy for joins Weakness : not easy for complex business logic. deals only structured data. Pig Strengths : Structured and unstructured data. joins are easily written. Weakness: new language to learn. converted into mapreduce.
  6. Spark Core SQL Structured Data Streaming Real-time MLib Machine Learning

    GraphX Graph Data A fast and general purpose framework for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Apache Spark MapReduce Hive Pig Mahaut HDFS
  7. How is Spark faster? RDD - A Resilient Distributed Dataset,

    the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. Caching + DAG model is enough to run them efficiently Combining libraries into one program is much faster Dataset < DataFrames < schema-RDD < RDD
  8. Spark SQL context = HiveContext(sc) results = context.sql("SELECT * FROM

    people") names = results.map(lambda p: p.name)
  9. MPP - Massive Parallel Processing TerraData Netezza Vertica Greenplum* HANA

    Redshift ** Pros 1. Scalable 2. Fault - tolerant 3. Distributed Cons 1. Cost 2. Complexity 3. Speed
  10. Streaming use cases • Stock Market • Clickstream Analysis •

    Fraud Detection • Real Time bidding • Trend analysis • Real Time Data Warehousing • ...
  11. Source Flow Manager Streaming Processor Storage Dashboard Streaming Data Pipeline

    Applications Mobile Devices Sensors: IOT Database CDC Log scrapping Async Actors: Akka Message Queues Kafka Flume Azure Event hub AWS Kinesis HDFS Storm Spark Streaming Azure Stream analytics Samza Flink Heron RDBMS NoSQL HDFS DW/Redshift Custom App D3 Tableau Cognos Excel </>
  12. Spark Streaming A data processing framework to build streaming applications.

    Why? 1. Scalable 2. Fault-tolerant 3. Simpler 4. Modular 5. Code reuse
  13. But Spark vs Storm..? • Storm is a stream processing

    framework that also does micro-batching (Trident). • Spark is a batch processing framework that also does micro-batching (Spark Streaming). Also read:https://www.quora.com/What-are-the-differences-between-Apache-Spark-and-Apache-Flink/answer/Santosh-Sahoo
  14. Stream.scala 1. val conf = new SparkConf().setAppName("demoapp").setMaster("local[1]") 2. val sc

    = new SparkContext(conf) 3. val ssc = new StreamingContext(sc, Seconds(2)) 4. val kafkaConfig = Map("metadata.broker.list"->"localhost:9092") 5. val topics = Set("topic1") 6. val wordstream = KafkaUtils.createDirectStream(ssc, kafkaConfig, topics ) 7. wordstream.print() 8. ssc.start() 9. ssc.awaitTermination()
  15. Running Application spark-submit \ --class AppMain \ --master spark://192.168.10.21:7077 \

    #local[*] --executor-memory 20G \ --total-executor-cores 100 \ /path/to/code.jar \ 1000
  16. Composite Example // Load data using SQL points = ctx.sql(“select

    latitude, longitude from hive_tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)
  17. Apache Kafka No nonsense logging platform • 100K/s throughput vs

    20k of RabbitMQ • Log compaction • Durable persistence • Partition tolerance • Replication • Best in class integration with Spark ◦ http://spark.apache.org/docs/latest/streaming-kafka-integration.html
  18. OLTP Reporting Cognos Tableau ? Stream Processor Spark HDFS Import

    FTP HTTP SMTP P Protobuf Json Broker Kafka Hive/ Spark SQL OLAP Load balance Failover HANA HANA OLAP Replication Service bus Normalization Extract Compensate Data {Quality, Correction, Analytics} Migrate method API/SQL Expense Travel TTX API Concur Next Gen C Tachyon