Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ScalaDaysBerlin - Machine learning on scala 2016

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

ScalaDaysBerlin - Machine learning on scala 2016

An analysis of the different types of programmers that may be coming to Scala/Fast Data and what their strong points are. The Scala community will have to adapt to the newcomers, with problems similar to the ones Europe is facing in the migration crysis.

Avatar for Jose Quesada

Jose Quesada

June 16, 2016
Tweet

Other Decks in Programming

Transcript

  1. Machine Learning with Scala on Spark Jose Quesada, David Anderson

    Data Science Retreat @quesada , @alpinegizmo, @datascienceret
  2. Machine learning is a subfield of computer science that deals

    with systems that can learn from data, rather than follow explicitly programmed instructions.
  3. Time VC pitch 90s … it’s like X, but on

    the web 00s … it’s like X, but social 10s 2016
  4. Time VC pitch 90s … it’s like X, but on

    the web 00s … it’s like X, but social 10s … it’s like X, but on mobile 2016
  5. Time VC pitch 90s … it’s like X, but on

    the web 00s … it’s like X, but social 10s … it’s like X, but on mobile 2016 … it’s like X, but with machine learning
  6. • Mentors are world-class. CTOs, library authors, inventors, founders of

    fast-growing companies, etc • DSR accepts fewer than 5% of the applications • Strong focus on commercial awareness • 5 years of working experience on average • 30+ partner companies in Europe
  7. Why is DSR talking about Scala/Spark? IBM is behind this

    They hired us to make training materials
  8. Scala succeeds at coping with large and fast data where

    older languages fail. Andy Petrella and Dean Wampler
  9. R is really important to the point that it’s hard

    to overvalue it. It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems. Daryl Pregibon, a research scientist at Google
  10. “Spark will inevitably become the de-facto Big Data framework for

    Machine Learning and Data Science.” Dean Wampler, Lightbend
  11. Spark is growing the Scala community faster than anything else

    50% of the newcomers are here because of Spark
  12. • 25 videos. There will be 75 by July 15th

    • Assumes familiarity with machine learning • Explains the ’how’ but not the ‘why’
  13. Teaching Scala to people who come from another language 1.

    Hire person 2. Have him learn Scala in a swim or sink situation
  14. Mindset changes that are very ‘uncomfortable’ for engineers • In

    Machine learning, there’s no spec • In ML, It’s very hard to know if it will work when it doesn’t
  15. Doing machine learning in Scala/spark vs other frameworks Ease of

    use, productivity, feature set, and performance
  16. How Spark killed Hadoop map/reduce • Far easier to program

    • More cost-effective since less hardware can perform the same tasks much faster • Can do real-time processing as well as batch processing • Can do ML, graphs
  17. Machine learning with Spark • Spark was designed for ML

    workloads • Caching (reuse data) • Accumulators (keep state across iterations) • Functional, lazy, fault-tolerant • Many popular algorithms are supported out of the box • Simple to productionalize models • MLlib is RDD (the past), spark.ml is dataframes, the future
  18. Spark is an Ecosystem of ML frameworks • Spark was

    designed by people who understood the need of ML practitioners (unlike Hadoop) • MLlib • Spark.ml • System.ml (IBM) • Keystone.ml
  19. Spark.ML– the basics • DataFrame: ML requires DFs holding vectors

    • Transformer: transforms one DF into another • Estimator: fit on a DF; produces a transformer • Pipeline: chain of transformers and estimators • Parameter: there is a unified API for specifying parameters • Evaluator: • CrossValidator: model selection via grid search
  20. Q: Hardest scaling problem in data science? A: Adding people

    • Spark.ml has a clean architecture and APIs that should encourage code sharing and reuse • Good first step: can you refactor some ETL code as a Transformer? • Don't see much sharing of components happening yet • Entire libraries, yes; components, not so much • Perhaps because Spark has been evolving so quickly • E.g., pull request implementing non-linear SVMs that has been stuck for a year
  21. Structured types in Spark SQL DataFrames DataSets (Java/Scala only) Syntax

    Errors Runtime Compile time Compile time Analysis Errors Runtime Runtime Compile time
  22. Indexing categorical features • You are responsible for identifying and

    indexing categorical features val rfcd_indexer = new StringIndexer() .setInputCol("color") .setOutputCol("color_index") .fit(dataset) val seo_indexer = new StringIndexer() .setInputCol("status") .setOutputCol("status_index") .fit(dataset)
  23. Assembling features • You must gather all of your features

    into one Vector, using a VectorAssembler val assembler = new VectorAssembler() .setInputCols(Array("color_index", "status_index", ...)) .setOutputCol("features")
  24. Spark.ml – Scikit-learn: Pipelines (good news!) • Spark ML and

    scikit-learn: same approach • Chain together Estimators and Transformers • Support non-linear pipelines (must be a DAG) • Unify parameter passing • Support for cross-validation and grid search • Can write your own custom pipeline stages Spark.ml just like scikit-learn
  25. Transformer Description scikit-learn Binarizer Threshold numerical feature to binary Binarizer

    Bucketizer Bucket numerical features into ranges ElementwiseProduct Scale each feature/column separately HashingTF Hash text/data to vector. Scale by term frequency FeatureHasher IDF Scale features by inverse document frequency TfidfTransformer Normalizer Scale each row to unit norm Normalizer OneHotEncoder Encode k-category feature as binary features OneHotEncoder PolynomialExpansion Create higher-order features PolynomialFeatures RegexTokenizer Tokenize text using regular expressions (part of text methods) StandardScaler Scale features to 0 mean and/or unit variance StandardScaler StringIndexer Convert String feature to 0-based indices LabelEncoder Tokenizer Tokenize text on whitespace (part of text methods) VectorAssembler Concatenate feature vectors FeatureUnion VectorIndexer Identify categorical features, and index Word2Vec Learn vector representation of words Spark.ml – Scikit-learn: NLP tasks (thumbs up)
  26. Graph stuff (graphX, graphframes, not great) • Extremely easy to

    run monster algorithms in a cluster • Graphframes are cool, and should provide a better interface to the graph tools • In practice, it didn’t work too well
  27. Things we liked in Spark ML • Architecture encourages building

    reusable pieces • Type safety, plus types are driving optimizations • Model fitting returns an object that transforms the data • Uniform way of passing parameters • It's interesting to use the same platform for ETL and model fitting • Very easy to parallelize ETL and grid search, or work with huge models
  28. Disappointments using Spark ML • Feature indexing and assembly can

    become tedious • Surprised by the maximum depth limit for trees: 30 • Data exploration and visualization aren't easy in Scala • Wish list: non-linear SVMs, deep learning (but see Deeplearning4j)
  29. Reminder: 25 videos explaining ML on spark • For people

    who already know ML • http://datascienceretreat.com/videos/data-science-with-scala-and- spark)