Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark: A Coding Joyride | QCon SF 2015

Spark: A Coding Joyride | QCon SF 2015

In this presentation from QCon SF, Doug Bateman walks through some of engines in Spark, and demonstrates its ability to rapidly process Big Data. He'll cover:

+ Extracting information with RDDs
+ Querying data using DataFrames
+ Visualizing and plotting data
+ Creating a machine-learning pipeline with Spark-ML and MLLib.

He'll also discuss the internals which make Spark 10-100 times
faster than Hadoop MapReduce and Hive.

4867789d19ac436c6e5a4e79b956c4b0?s=128

NewCircle Training

November 25, 2015
Tweet

Transcript

  1. Spark: A Coding Joyride Doug Bateman Director of Training, NewCircle

  2. • Show Spark's ability to rapidly process Big Data •

    Extracting information with RDDs • Querying data using DataFrames • Visualizing and plotting data • Create a machine-learning pipeline with Spark-ML and MLLib. • We'll also discuss the internals which make Spark 10-100 times faster than Hadoop MapReduce and Hive. Objectives 2
  3. About Me Manage the development and delivery of software development

    trainings. • Java since 1995 (Java 1.0) • 15+ years developing software, consulting, and training development teams. Engineer, Architect & Instructor Director of Training, NewCircle 3
  4. For Fun About Me • Sailing • Rock climbing •

    Snowboarding • Chess 4
  5. Who are you? 0) I am new to spark. 1)

    I have used Spark hands on before… 2) I have more than 1 year hands on experience with spark.. 5
  6. Environments Workloads Goal: unified engine across data , sources workloads

    environments and Data Sources 6
  7. {JSON} Data Sources Spark Core Spark Streaming Spark SQL MLlib

    GraphX RDD API DataFrames API Environments Workloads YARN 7
  8. Spark – 100% open source and mature Used in production

    by over 500 organizations. From fortune 100 to small innovators 8
  9. Apache Spark: Large user community 0 1000 2000 3000 4000

    Commits in the past year 9
  10. Large-Scale Usage Largest cluster: 8000 nodes Largest single job: 1

    petabyte Top streaming intake: 1 TB/hour 2014 on-disk 100 TB sort record 10
  11. 11 On-Disk Sort Record:
 Time to sort 100TB Source: Daytona

    GraySort benchmark, sortbenchmark.org 2100 machines 2013 Record: 
 Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes
  12. Spark Driver Executor Slot Slot Executor Slot Slot Executor Slot

    Slot Executor Slot Slot JVM JVM JVM JVM JVM Spark Physical Cluster 12
  13. Spark Driver Executor Task Task Executor Task Slot Executor Slot

    Slot Executor Task Task JVM JVM JVM JVM JVM Spark Physical Cluster 13
  14. Power Plant Demo 14

  15. Use Case: predict power output given a set of readings

    from various sensors in a gas-fired power generation plant Schema Definition: AT = Atmospheric Temperature in C V = Exhaust Vacuum Speed AP = Atmospheric Pressure RH = Relative Humidity PE = Power Output (value we are trying to predict) 15
  16. 1. ETL
 2. Explore + Visualize Data
 3. Apply Machine

    Learning Steps: 16
  17. Cloud-based integrated workspace for Spark • Contributed more than 75%

    of the code added to Spark in the last year • Company spun from the original Spark team at UC Berkeley 17 About Databricks
  18. Software Development Training for the Enterprise Android, Big Data, Java,

    JavaScript MV*, Python, and more… • Courses tailored for your team • Global delivery at scale • Custom training programs & courseware development 18 About NewCircle https://newcircle.com
  19. Thank you.