Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark: A Coding Joyride | QCon SF 2015

Spark: A Coding Joyride | QCon SF 2015

In this presentation from QCon SF, Doug Bateman walks through some of engines in Spark, and demonstrates its ability to rapidly process Big Data. He'll cover:

+ Extracting information with RDDs
+ Querying data using DataFrames
+ Visualizing and plotting data
+ Creating a machine-learning pipeline with Spark-ML and MLLib.

He'll also discuss the internals which make Spark 10-100 times
faster than Hadoop MapReduce and Hive.

NewCircle Training

November 25, 2015
Tweet

More Decks by NewCircle Training

Other Decks in Technology

Transcript

  1. • Show Spark's ability to rapidly process Big Data •

    Extracting information with RDDs • Querying data using DataFrames • Visualizing and plotting data • Create a machine-learning pipeline with Spark-ML and MLLib. • We'll also discuss the internals which make Spark 10-100 times faster than Hadoop MapReduce and Hive. Objectives 2
  2. About Me Manage the development and delivery of software development

    trainings. • Java since 1995 (Java 1.0) • 15+ years developing software, consulting, and training development teams. Engineer, Architect & Instructor Director of Training, NewCircle 3
  3. Who are you? 0) I am new to spark. 1)

    I have used Spark hands on before… 2) I have more than 1 year hands on experience with spark.. 5
  4. {JSON} Data Sources Spark Core Spark Streaming Spark SQL MLlib

    GraphX RDD API DataFrames API Environments Workloads YARN 7
  5. Spark – 100% open source and mature Used in production

    by over 500 organizations. From fortune 100 to small innovators 8
  6. Large-Scale Usage Largest cluster: 8000 nodes Largest single job: 1

    petabyte Top streaming intake: 1 TB/hour 2014 on-disk 100 TB sort record 10
  7. 11 On-Disk Sort Record:
 Time to sort 100TB Source: Daytona

    GraySort benchmark, sortbenchmark.org 2100 machines 2013 Record: 
 Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes
  8. Spark Driver Executor Slot Slot Executor Slot Slot Executor Slot

    Slot Executor Slot Slot JVM JVM JVM JVM JVM Spark Physical Cluster 12
  9. Spark Driver Executor Task Task Executor Task Slot Executor Slot

    Slot Executor Task Task JVM JVM JVM JVM JVM Spark Physical Cluster 13
  10. Use Case: predict power output given a set of readings

    from various sensors in a gas-fired power generation plant Schema Definition: AT = Atmospheric Temperature in C V = Exhaust Vacuum Speed AP = Atmospheric Pressure RH = Relative Humidity PE = Power Output (value we are trying to predict) 15
  11. Cloud-based integrated workspace for Spark • Contributed more than 75%

    of the code added to Spark in the last year • Company spun from the original Spark team at UC Berkeley 17 About Databricks
  12. Software Development Training for the Enterprise Android, Big Data, Java,

    JavaScript MV*, Python, and more… • Courses tailored for your team • Global delivery at scale • Custom training programs & courseware development 18 About NewCircle https://newcircle.com