Spark Committer Night meetup @ NYC

Spark Committer Night October 15, 2014

Panelists Matei Zaharia Michael Armbrust Joseph Bradley Paco Nathan Patrick
Wendell Tathagata Das Reynold Xin

Overview Introductions (Matei) Short talks (Reynold, Patrick, Joseph) Q &
A

What is Apache Spark? Fast and general engine for big
data processing Generalizes the MapReduce model to support low-latency and complex analytics Most active open source project in big data

About Databricks Founded by the creators of Spark in 2013
Drives open source Spark development, and offers a cloud service (Databricks Cloud) Largest organization contributing to Spark >  Over 1000 patches in the past year

Community Growth 0 25 50 75 100 2010 2011 2012
2013 2014 Contributors / Month to Spark

Community Growth 0 25 50 75 100 2010 2011 2012
2013 2014 Contributors / Month to Spark 2-3x more activity than: Hadoop, Storm, MongoDB, NumPy, D3, Julia, …

What’s New in Spark? Petabyte sort record Application integration > 
Tableau, Trifacta, Talend, ElasticSearch, … Ongoing development for Spark 1.2 >  Python streaming, new MLlib API, YARN scaling, … Spark certification

Short Talks Petabyte sort (Reynold Xin) Spark 1.2 development (Patrick
Wendell) Machine learning pipelines (Joseph Bradley)

Petabyte Sort in Spark Reynold Xin

Common Misconception About Spark “Spark is in-memory. It doesn’t work
with BIG DATA.” “It is too expensive to buy a cluster with enough memory to fit our data.”

Spark Project Goals Works well with GBs, TBs, or PBs
or data Works well with different storage media (RAM, HDDs, SSDs) Works well from low-latency streaming jobs to long batch jobs

Sorting 100 TB and 1 PB Participated in this year’s
Sort Benchmark Sorted 100TB following benchmark rules >  Data only on disk Sorted 1PB on our own to push scalability (no official benchmark exists)

Spark sorted the same amount of data 3X faster using
10X fewer machines.

What made this possible? Sort-based shuffle (SPARK-2045) Netty native network
transport (SPARK-2468) External shuffle service (SPARK-3796) Power of the Cloud (Amazon EC2) ……

http://tinyurl.com/spark-sort

Spark 1.1 and 1.2 Patrick Wendell

Spark Releases Release every 3 months: 1.0, 1.1, 1.2 Patch
releases when necessary: 1.0.2 Spark 1.1: Released September 11 Spark 1.2: Release early December

Spark Components Spark Core Spark SQL structured Spark Streaming real-time
MLlib machine learning GraphX graph

Roadmap Spark 1.1 and 1.2 have similar themes Spark core:
Usability, stability, and performance MLlib/SQL/Streaming: Expanded feature set and performance Around ~40% of mailing list traffic is about these libraries.

Spark Core 1.1 “Sort based” shuffle implementation Optimized broadcasts Disk
spilling improvements 1.2 Re-written network module Scala 2.11 support Better debugging tools

Spark SQL 1.1 JDBC server for multi-tenant access and BI
tools Native JSON support Native Parquet support and optimizations 1.2 Support for Hive 0.13 External data sources API (Avro, Cassandra, etc). Native ORC support

Spark Streaming 1.1 Amazon Kinesis support Support for polling Flume
streams Streaming + ML: Streaming linear regression 1.2 Python streaming API Write-ahead log for full H/A operation

Python API for Streaming import sys from pyspark import SparkContext
from pyspark.streaming import StreamingContext if __name__ == "__main__": sc = SparkContext(appName="StreamingWordCount") ssc = StreamingContext(sc, 1) lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) counts = lines.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.print() ssc.start() github.com/apache/spark/blob/master/examples/src/main/python/ streaming/network_wordcount.py Ken Takagiwa github/giwa Davies Liu github/davies

MLlib and GraphX 1.1 Algorithms (SVD, multiclass decision tree….) Feature
extraction utilities (word2vec, tf-idf) Statistics library 1.2 Pipeline-based interface to all algorithms Many new algorithms including Random Forests Stable API for GraphX

Updating the MLlib API: Pipelines & Datasets Joseph K. Bradley

ML Pipelines Typical ML workflow Training Data Feature Extraction Model
Training Model Testing Test Data

ML Pipelines Training Data Feature Extraction Model Training Model Testing
Test Data Feature Extraction Other Training Data Other Model Training Typical ML workflow is complex.

ML Pipelines Pipelines under development •  Easy workflow construction • 
Standardized interface for model tuning •  Testing & failing early Typical ML workflow is complex. Uniform API for all pipeline components

Datasets Further Integration with Spark SQL ML pipelines require Datasets
•  Handle many data types (features) •  Keep metadata about features •  Select subsets of features for different parts of pipeline •  Join groups of features ML Dataset = SchemaRDD Under development

ML Pipeline Example training = sqlContext.sql("SELECT ... FROM ... ").cache()
interactor = Interactor() fvAssembler = FeatureVectorAssembler() treeClassifer = DecisionTreeClassifer() paramMap = ParamMap() .put(interactor.features, {"genderMatch" : ["userGender", "targetGender”]}) .put(fvAssembler.features, {"features" : ["genderMatch", "userCountryIndex", ...]}) .put(treeClassifer.maxDepth, 4) pipeline = Pipeline.create(interactor, fvAssembler, treeClassifier) model = pipeline.fit(training, paramMap)

Spark Certification Apache Spark developer certificate program •  http://www.oreilly.com/go/sparkcert • 
Defined by Spark experts @Databricks •  Assessed by O’Reilly Media

www.spark-summit.org

Spark Committer Night meetup @ NYC

Spark Committer Night meetup @ NYC

Reynold Xin

More Decks by Reynold Xin

Other Decks in Technology

Featured

Transcript

Spark Committer Night October 15, 2014

Panelists Matei Zaharia Michael Armbrust Joseph Bradley Paco Nathan Patrick

Overview Introductions (Matei) Short talks (Reynold, Patrick, Joseph) Q &

What is Apache Spark? Fast and general engine for big

About Databricks Founded by the creators of Spark in 2013

Community Growth 0 25 50 75 100 2010 2011 2012

Community Growth 0 25 50 75 100 2010 2011 2012

What’s New in Spark? Petabyte sort record Application integration >

Short Talks Petabyte sort (Reynold Xin) Spark 1.2 development (Patrick

Petabyte Sort in Spark Reynold Xin

Common Misconception About Spark “Spark is in-memory. It doesn’t work

Spark Project Goals Works well with GBs, TBs, or PBs

Sorting 100 TB and 1 PB Participated in this year’s

Spark sorted the same amount of data 3X faster using

What made this possible? Sort-based shuffle (SPARK-2045) Netty native network

http://tinyurl.com/spark-sort

Spark 1.1 and 1.2 Patrick Wendell

Spark Releases Release every 3 months: 1.0, 1.1, 1.2 Patch

Spark Components Spark Core Spark SQL structured Spark Streaming real-time

Roadmap Spark 1.1 and 1.2 have similar themes Spark core:

Spark Core 1.1 “Sort based” shuffle implementation Optimized broadcasts Disk

Spark SQL 1.1 JDBC server for multi-tenant access and BI

Spark Streaming 1.1 Amazon Kinesis support Support for polling Flume

Python API for Streaming import sys from pyspark import SparkContext

MLlib and GraphX 1.1 Algorithms (SVD, multiclass decision tree….) Feature

Updating the MLlib API: Pipelines & Datasets Joseph K. Bradley

ML Pipelines Typical ML workflow Training Data Feature Extraction Model

ML Pipelines Training Data Feature Extraction Model Training Model Testing

ML Pipelines Pipelines under development •  Easy workflow construction •

Datasets Further Integration with Spark SQL ML pipelines require Datasets

ML Pipeline Example training = sqlContext.sql("SELECT ... FROM ... ").cache()

Spark Certification Apache Spark developer certificate program •  http://www.oreilly.com/go/sparkcert •

www.spark-summit.org