ScalaDaysBerlin - Machine learning on scala 2016

Machine Learning with Scala on Spark Jose Quesada, David Anderson
Data Science Retreat @quesada , @alpinegizmo, @datascienceret

Machine learning is a subfield of computer science that deals
with systems that can learn from data, rather than follow explicitly programmed instructions.

Why this talk?

Five years from now, everyone in this room will be
deploying machine learning

The alternative: Machine learning remains a black art

Time VC pitch 90s … it’s like X, but on
the web 00s 10s 2016

the web 00s … it’s like X, but social 10s 2016

the web 00s … it’s like X, but social 10s … it’s like X, but on mobile 2016

the web 00s … it’s like X, but social 10s … it’s like X, but on mobile 2016 … it’s like X, but with machine learning

Marc Andreessen Sri Ambati, H2O CEO

https://vimeo.com/146492001 NeuralTalk and walk, Kyle McDonald

• Mentors are world-class. CTOs, library authors, inventors, founders of
fast-growing companies, etc • DSR accepts fewer than 5% of the applications • Strong focus on commercial awareness • 5 years of working experience on average • 30+ partner companies in Europe

DSR participants do a portfolio project

Why is DSR talking about Scala/Spark? IBM is behind this
They hired us to make training materials

Source: Spark 2015 infographic

Scala succeeds at coping with large and fast data where
older languages fail. Andy Petrella and Dean Wampler

Personas Non-engineer machine learner Bearded Backend Bobby Corporate Java Joe
The SMACK stack badass

Non-engineer machine learner • • • • • • •
•

R is really important to the point that it’s hard
to overvalue it. It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems. Daryl Pregibon, a research scientist at Google

Time Mindshare among ‘data science badasses’ (subjective)

Bearded Backend Bobby • • • • • • •

Corporate Java Joe • • • • •

SMACK stack badass • • • • •

Personas Non-engineer machine learner Bearded Backend Bobby Corporate Java Joe
The SMACK stack badass

Non-engineer machine learner Bearded Backend Bobby Corporate Java Joe The
SMACK stack badass

What databricks is doing right

SMACK stack badass

“Spark will inevitably become the de-facto Big Data framework for
Machine Learning and Data Science.” Dean Wampler, Lightbend

Spark is growing the Scala community faster than anything else
50% of the newcomers are here because of Spark

Onboarding

SMACK stack badass

• 25 videos. There will be 75 by July 15th
• Assumes familiarity with machine learning • Explains the ’how’ but not the ‘why’

Teaching Scala to people who come from another language "
Peter Seibel

Teaching Scala to people who come from another language 1.
Hire person 2. Have him learn Scala in a swim or sink situation

SMACK stack badass

Mindset changes that are very ‘uncomfortable’ for engineers • In
Machine learning, there’s no spec

Mindset changes that are very ‘uncomfortable’ for engineers • In
Machine learning, there’s no spec • In ML, It’s very hard to know if it will work when it doesn’t

Doing machine learning in Scala/spark vs other frameworks Ease of
use, productivity, feature set, and performance

How Spark killed Hadoop map/reduce • Far easier to program
• More cost-effective since less hardware can perform the same tasks much faster • Can do real-time processing as well as batch processing • Can do ML, graphs

Machine learning with Spark • Spark was designed for ML
workloads • Caching (reuse data) • Accumulators (keep state across iterations) • Functional, lazy, fault-tolerant • Many popular algorithms are supported out of the box • Simple to productionalize models • MLlib is RDD (the past), spark.ml is dataframes, the future

Spark is an Ecosystem of ML frameworks • Spark was
designed by people who understood the need of ML practitioners (unlike Hadoop) • MLlib • Spark.ml • System.ml (IBM) • Keystone.ml

Spark.ML– the basics • DataFrame: ML requires DFs holding vectors
• Transformer: transforms one DF into another • Estimator: fit on a DF; produces a transformer • Pipeline: chain of transformers and estimators • Parameter: there is a unified API for specifying parameters • Evaluator: • CrossValidator: model selection via grid search

Hyper-parameter tuning Machine Learning scaling challenges that Spark solves

Hyper-parameter tuning Machine Learning scaling challenges that Spark solves ETL/feature
engineering

Hyper-parameter tuning Machine Learning scaling challenges that Spark solves ETL/feature
engineering Model

Q: Hardest scaling problem in data science? A: Adding people
• Spark.ml has a clean architecture and APIs that should encourage code sharing and reuse • Good first step: can you refactor some ETL code as a Transformer? • Don't see much sharing of components happening yet • Entire libraries, yes; components, not so much • Perhaps because Spark has been evolving so quickly • E.g., pull request implementing non-linear SVMs that has been stuck for a year

Structured types in Spark SQL DataFrames DataSets (Java/Scala only) Syntax
Errors Runtime Compile time Compile time Analysis Errors Runtime Runtime Compile time

User experience Spark.ml – Scikit-learn

Indexing categorical features • You are responsible for identifying and
indexing categorical features val rfcd_indexer = new StringIndexer() .setInputCol("color") .setOutputCol("color_index") .fit(dataset) val seo_indexer = new StringIndexer() .setInputCol("status") .setOutputCol("status_index") .fit(dataset)

Assembling features • You must gather all of your features
into one Vector, using a VectorAssembler val assembler = new VectorAssembler() .setInputCols(Array("color_index", "status_index", ...)) .setOutputCol("features")

Spark.ml – Scikit-learn: Pipelines (good news!) • Spark ML and
scikit-learn: same approach • Chain together Estimators and Transformers • Support non-linear pipelines (must be a DAG) • Unify parameter passing • Support for cross-validation and grid search • Can write your own custom pipeline stages Spark.ml just like scikit-learn

Transformer Description scikit-learn Binarizer Threshold numerical feature to binary Binarizer
Bucketizer Bucket numerical features into ranges ElementwiseProduct Scale each feature/column separately HashingTF Hash text/data to vector. Scale by term frequency FeatureHasher IDF Scale features by inverse document frequency TfidfTransformer Normalizer Scale each row to unit norm Normalizer OneHotEncoder Encode k-category feature as binary features OneHotEncoder PolynomialExpansion Create higher-order features PolynomialFeatures RegexTokenizer Tokenize text using regular expressions (part of text methods) StandardScaler Scale features to 0 mean and/or unit variance StandardScaler StringIndexer Convert String feature to 0-based indices LabelEncoder Tokenizer Tokenize text on whitespace (part of text methods) VectorAssembler Concatenate feature vectors FeatureUnion VectorIndexer Identify categorical features, and index Word2Vec Learn vector representation of words Spark.ml – Scikit-learn: NLP tasks (thumbs up)

Graph stuff (graphX, graphframes, not great) • Extremely easy to
run monster algorithms in a cluster • Graphframes are cool, and should provide a better interface to the graph tools • In practice, it didn’t work too well

Things we liked in Spark ML • Architecture encourages building
reusable pieces • Type safety, plus types are driving optimizations • Model fitting returns an object that transforms the data • Uniform way of passing parameters • It's interesting to use the same platform for ETL and model fitting • Very easy to parallelize ETL and grid search, or work with huge models

Disappointments using Spark ML • Feature indexing and assembly can
become tedious • Surprised by the maximum depth limit for trees: 30 • Data exploration and visualization aren't easy in Scala • Wish list: non-linear SVMs, deep learning (but see Deeplearning4j)

Reminder: 25 videos explaining ML on spark • For people
who already know ML • http://datascienceretreat.com/videos/data-science-with-scala-and- spark)

Thank you for your attention! @quesada, @datascienceret

ScalaDaysBerlin - Machine learning on scala 2016

ScalaDaysBerlin - Machine learning on scala 2016

Other Decks in Programming

Featured

Transcript