Apache Spark: Unified Platform for Big Data

Slide 1

Slide 1 text

! ! ! ! ! ! “Simple things simple, complex things possible.” ! ! - Alan Kay

Slide 2

Slide 2 text

Apache Spark: Uniﬁed Platform for Big Data Reynold Xin, Databricks

Slide 3

Slide 3 text

Apache Spark: Uniﬁed Platform for Big Data Reynold Xin, Databricks /\ / \ | | | | | | | | / \ -————- The tower at Berkeley campus

Slide 4

Slide 4 text

Founded last year by creators of Apache Spark ! ! Building a cloud platform to drastically simplify Big Data

Slide 5

Slide 5 text

! ! ! ! ! ! “Simple things simple, complex things possible.” ! ! - Alan Kay

Slide 6

Slide 6 text

Big Data Space Today

Slide 7

Slide 7 text

Big Data Space Today ! ! ! Zoo of tools to learn, deploy, connect, and maintain. ! ! ! ! ! “Simple things complex, complex things impossible!” ! !

Slide 8

Slide 8 text

– Alan Kay “Simple things simple, complex things possible.”

Slide 9

Slide 9 text

Looking for the “lingua franca” platform for ! Big Data

Slide 10

Slide 10 text

Why a Platform Matters Good for developers: one system to learn Good for users: take apps anywhere Good for distributors: more applications

Slide 11

Slide 11 text

Apache Spark “The Lingua franca for Big Data” — Eric Baldeschwieler

Slide 12

Slide 12 text

Apache Spark “The Lingua franca for Big Data” — Eric Baldeschwieler ! 1. Uniﬁed processing platform 2. Rich standard libraries 3. Availability 4. Scalability

Slide 13

Slide 13 text

1. Uniﬁed Processing Platform for Big Data Batch Interactive Streaming Hadoop Cassandra Mesos … … Cloud Providers … Uniform API for diverse workloads over diverse storage systems and runtimes

Slide 14

Slide 14 text

2. Standard Library for Big Data Big Data apps lack libraries  of common algorithms ! Spark’s generality + support  for multiple languages make it  suitable to offer this Core SQL ML graph … Python Scala Java

Slide 15

Slide 15 text

Spark MLlib One of the most actively developed library. • classification: logistic regression, linear support vector machine (SVM), naive Bayes, classification tree • regression: generalized linear models (GLMs), regression tree • collaborative filtering: alternating least squares (ALS) • clustering: k-means • decomposition: singular value decomposition (SVD), principal component analysis (PCA) • statistics: summary statistics • evaluation: binary classification • optimization: gradient descent, L-BFGS

Slide 16

Slide 16 text

Spark MLlib 1.1 (Aug/Sep 2014) improvements • standardized interfaces: dataset, algorithm, params, and model • multi-model training • multiclass support for classiﬁcation tree • Java/Python APIs for decision trees • SVD via Lanczos • standardized text format for training data new • statistics: stratiﬁed sampling, linear/ rank correlation, hypothesis testing • non-negative matrix factorization (NMF) • preprocessing: tf-idf • evaluation: multiclass metrics • online model updates with streaming • and your contribution!

Slide 17

Slide 17 text

3. Available Everywhere All major Hadoop distributions include Spark ! !   Beyond Hadoop

Slide 18

Slide 18 text

4. Fault-tolerance & Scalability Cluster size: in production clusters with 1000+ nodes Data size: processes data many times size of memory  - petabyte sort on lots of machines  - or handle tons of data on a few machines (sorting 5TB compressed/node; ﬁle system breaking before Spark did)

Slide 19

Slide 19 text

A Simple Demo

Slide 20

Slide 20 text

Spark is the most active open-source Big Data project by - active contributors (300+) - commits - line of code changes

Slide 21

Slide 21 text

– Alan Kay “Simple things simple, complex things possible.” Let’s work towards

Slide 22

Slide 22 text

And or course, we are hiring ! [email protected]