Spark: A Coding Joyride | QCon SF 2015

Spark: A Coding Joyride Doug Bateman Director of Training, NewCircle

• Show Spark's ability to rapidly process Big Data •
Extracting information with RDDs • Querying data using DataFrames • Visualizing and plotting data • Create a machine-learning pipeline with Spark-ML and MLLib. • We'll also discuss the internals which make Spark 10-100 times faster than Hadoop MapReduce and Hive. Objectives 2

About Me Manage the development and delivery of software development
trainings. • Java since 1995 (Java 1.0) • 15+ years developing software, consulting, and training development teams. Engineer, Architect & Instructor Director of Training, NewCircle 3

For Fun About Me • Sailing • Rock climbing •
Snowboarding • Chess 4

Who are you? 0) I am new to spark. 1)
I have used Spark hands on before… 2) I have more than 1 year hands on experience with spark.. 5

Environments Workloads Goal: unified engine across data , sources workloads
environments and Data Sources 6

{JSON} Data Sources Spark Core Spark Streaming Spark SQL MLlib
GraphX RDD API DataFrames API Environments Workloads YARN 7

Spark – 100% open source and mature Used in production
by over 500 organizations. From fortune 100 to small innovators 8

Apache Spark: Large user community 0 1000 2000 3000 4000
Commits in the past year 9

Large-Scale Usage Largest cluster: 8000 nodes Largest single job: 1
petabyte Top streaming intake: 1 TB/hour 2014 on-disk 100 TB sort record 10

11 On-Disk Sort Record:  Time to sort 100TB Source: Daytona
GraySort benchmark, sortbenchmark.org 2100 machines 2013 Record:   Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes

Spark Driver Executor Slot Slot Executor Slot Slot Executor Slot
Slot Executor Slot Slot JVM JVM JVM JVM JVM Spark Physical Cluster 12

Spark Driver Executor Task Task Executor Task Slot Executor Slot
Slot Executor Task Task JVM JVM JVM JVM JVM Spark Physical Cluster 13

Power Plant Demo 14

Use Case: predict power output given a set of readings
from various sensors in a gas-fired power generation plant Schema Definition: AT = Atmospheric Temperature in C V = Exhaust Vacuum Speed AP = Atmospheric Pressure RH = Relative Humidity PE = Power Output (value we are trying to predict) 15

1. ETL  2. Explore + Visualize Data  3. Apply Machine
Learning Steps: 16

Cloud-based integrated workspace for Spark • Contributed more than 75%
of the code added to Spark in the last year • Company spun from the original Spark team at UC Berkeley 17 About Databricks

Software Development Training for the Enterprise Android, Big Data, Java,
JavaScript MV*, Python, and more… • Courses tailored for your team • Global delivery at scale • Custom training programs & courseware development 18 About NewCircle https://newcircle.com

Thank you.

Spark: A Coding Joyride | QCon SF 2015

Spark: A Coding Joyride | QCon SF 2015

NewCircle Training

More Decks by NewCircle Training

Other Decks in Technology

Featured

Transcript

Spark: A Coding Joyride Doug Bateman Director of Training, NewCircle

• Show Spark's ability to rapidly process Big Data •

About Me Manage the development and delivery of software development

For Fun About Me • Sailing • Rock climbing •

Who are you? 0) I am new to spark. 1)

Environments Workloads Goal: unified engine across data , sources workloads

{JSON} Data Sources Spark Core Spark Streaming Spark SQL MLlib

Spark – 100% open source and mature Used in production

Apache Spark: Large user community 0 1000 2000 3000 4000

Large-Scale Usage Largest cluster: 8000 nodes Largest single job: 1

11 On-Disk Sort Record:  Time to sort 100TB Source: Daytona

Spark Driver Executor Slot Slot Executor Slot Slot Executor Slot

Spark Driver Executor Task Task Executor Task Slot Executor Slot

Power Plant Demo 14

Use Case: predict power output given a set of readings

1. ETL  2. Explore + Visualize Data  3. Apply Machine

Cloud-based integrated workspace for Spark • Contributed more than 75%

Software Development Training for the Enterprise Android, Big Data, Java,

Thank you.