Spark: A Coding Joyride | QCon SF 2015

Slide 1

Slide 1 text

Spark: A Coding Joyride Doug Bateman Director of Training, NewCircle

Slide 2

Slide 2 text

• Show Spark's ability to rapidly process Big Data • Extracting information with RDDs • Querying data using DataFrames • Visualizing and plotting data • Create a machine-learning pipeline with Spark-ML and MLLib. • We'll also discuss the internals which make Spark 10-100 times faster than Hadoop MapReduce and Hive. Objectives 2

Slide 3

Slide 3 text

About Me Manage the development and delivery of software development trainings. • Java since 1995 (Java 1.0) • 15+ years developing software, consulting, and training development teams. Engineer, Architect & Instructor Director of Training, NewCircle 3

Slide 4

Slide 4 text

For Fun About Me • Sailing • Rock climbing • Snowboarding • Chess 4

Slide 5

Slide 5 text

Who are you? 0) I am new to spark. 1) I have used Spark hands on before… 2) I have more than 1 year hands on experience with spark.. 5

Slide 6

Slide 6 text

Environments Workloads Goal: unified engine across data , sources workloads environments and Data Sources 6

Slide 7

Slide 7 text

{JSON} Data Sources Spark Core Spark Streaming Spark SQL MLlib GraphX RDD API DataFrames API Environments Workloads YARN 7

Slide 8

Slide 8 text

Spark – 100% open source and mature Used in production by over 500 organizations. From fortune 100 to small innovators 8

Slide 9

Slide 9 text

Apache Spark: Large user community 0 1000 2000 3000 4000 Commits in the past year 9

Slide 10

Slide 10 text

Large-Scale Usage Largest cluster: 8000 nodes Largest single job: 1 petabyte Top streaming intake: 1 TB/hour 2014 on-disk 100 TB sort record 10

Slide 11

Slide 11 text

11 On-Disk Sort Record:  Time to sort 100TB Source: Daytona GraySort benchmark, sortbenchmark.org 2100 machines 2013 Record:   Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes

Slide 12

Slide 12 text

Spark Driver Executor Slot Slot Executor Slot Slot Executor Slot Slot Executor Slot Slot JVM JVM JVM JVM JVM Spark Physical Cluster 12

Slide 13

Slide 13 text

Spark Driver Executor Task Task Executor Task Slot Executor Slot Slot Executor Task Task JVM JVM JVM JVM JVM Spark Physical Cluster 13

Slide 14

Slide 14 text

Power Plant Demo 14

Slide 15

Slide 15 text

Use Case: predict power output given a set of readings from various sensors in a gas-fired power generation plant Schema Definition: AT = Atmospheric Temperature in C V = Exhaust Vacuum Speed AP = Atmospheric Pressure RH = Relative Humidity PE = Power Output (value we are trying to predict) 15

Slide 16

Slide 16 text

1. ETL  2. Explore + Visualize Data  3. Apply Machine Learning Steps: 16

Slide 17

Slide 17 text

Cloud-based integrated workspace for Spark • Contributed more than 75% of the code added to Spark in the last year • Company spun from the original Spark team at UC Berkeley 17 About Databricks

Slide 18

Slide 18 text

Software Development Training for the Enterprise Android, Big Data, Java, JavaScript MV*, Python, and more… • Courses tailored for your team • Global delivery at scale • Custom training programs & courseware development 18 About NewCircle https://newcircle.com

Slide 19

Slide 19 text

Thank you.