WordPress Blog Posts Recommender in Spark, Scala and the SparkNotebook

WordPress Blog Posts Recommender in Spark, Scala and the SparkNotebook
Gianmario Spacagna @gm_spacagna https://github.com/gm-spacagna/wordpress-posts-recommender/

Goal   At the Advanced Data Analytics team at Barclays
we solved the Kaggle competition as proof-of-concept of how to use Spark, Scala and the Spark Notebook to solve a typical machine learning problem end-to-end.   Topics covered:   DataFrame/RDD conversions and I/O   Exploratory Data Analysis (EDA)   Scalable Feature Engineering   Modelling (MlLib and ML)   End-to-end Evaluation   Agile Methodology

Case Study   Recommending a sequence of WordPress blog posts
that the users may like based on their historical likes and blog/post/author characteristics   https://www.kaggle.com/c/predict-wordpress-likes

Why Scala?   Functional   Native to Spark   Compiled
/ Type Safe   JVM means reliability and portability   10x Productivity   Great development environments like IntelliJ   REPL shell

Why Spark?   Fast In Memory Computation   Interactive  
Support Functional Programming   Spark shell but lack of visualization tool and embedded story points

Spark Notebook Functions available in our custom libraries jar

What’s next?   Methodology/tools for:   Investigating Interactively the data;
and   Writing quality code in a productive environment; and   Embedding the developed functions into executable entry points; and   Presenting the results in a clean and visual way; and   Meeting the required acceptance criteria.   AKA: Delivering a Data Science MVP quickly in a complete Agile way!

New DataFrame API Cool but…

DataFrame vs RDDs   DataFrame:   Optimized execution plans  
Great I/O interfaces supporting many formats (json, jdbc, parquet)   Schema inference from the Raw Source   Columns defined at run time, no type tags   Can fail at runtime if columns are inexistent   UDFs must be registered   Ad-hoc Domain Specific Language   Looks more like Python/R   RDD:   Type safe API   Functional, yep! J   Is Scala code!   Can use Case Classes   Can use anonymous functions without registering them   You have to optimize your execution plan   Lower level Spark API   Requires a little bit more of engineering

DataFrame Row to Java/Scala Classes Mapping   From https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html :
  VectorType -> org.apache.spark.mllib.linalg.Vector

DataFrame to Case Class RDD   org.apache.spark.sql.Row objects can be
casted into Sequences of Scala classes: val anyFieldList: List[Any] = row.toSeq.toList val stringByIndex: String = row.getAs[String](3) val longByColumnName: Long = row.getAs[Long](“post_id”)   Careful with nulls   df.na.drop() to remove rows containing null columns   must explicitly handle for nested structures

Raw Data and Domain Specific Data Types   Raw Data:
  StatsBlog: 6 months of aggregate statistics about each blog. blog_id num_posts num_likes.   StatsUser: 6 months of aggregate statistics about each user's like behavior. user_id num_likes.   TrainPosts: Each line corresponds to one blog post from the training set along with the likes user_id array.   Define case classes the way you want to model your application domain data decoupling from the raw format:

DataFrame to Case Class RDD in action 1/2

DataFrame to Case Class RDD in action 2/2 Handling null
values with Options

EDA Initial investigation and data sanity check

Distribution of user num_likes   Histogram Spark API will create
uniform bins between min and max   Better providing your custom bins

Categories Frequency Distribution 45% of posts in the training set
are Uncategorized, maybe treating them separately?

Missing entries in the Stats data

Feature Engineering Extracting the Feature vectors for training and test
data

Populate Missing Features (logic)   Monoidal aggregations using |+| Scalaz
operator: Hack for solving serialization issues on MapLike objects Default value for missing keys

Populate Missing Features (entry points) Popular Blogs Average Likes per
post

Users Likelihood Maps (complex logic)   RDDs can be iterated
in for-comprehension Take only Top 100 terms Words count

Users Likelihood Maps (entry points and simple logic) Bound HashMap
size to fix the in-memory allocation of the broadcast variable Anonymous type, maybe better having a case class

Features   categoriesLikelihood: probability of liking the post by the
user only based on the categories term frquency of likes.   tagsLikelihood: probability of liking the post by the user only based on the tags term frquency of likes.   languageLikelihood: probability of liking the post by the user only based on the language frquency of likes.   authorLikelihood: probability of liking the post by the user only based on the author frequency of likes.   titleLengthMeanError: absolute difference between the title length and the average title length of liked posts.   blogLikelihood: probability of liking the post by the user only based on the blog frequency of likes.   averageLikesPerPost: average number of likes per post of that blog.

Features Extraction (logic) Map joins using Broadcast hash maps

Features Extraction of true class points

Like users distribution   We want to balance the number
of false records according to the same distribution of true records

Feature Extraction of false class points (logic)   For each
blog post generate a sample of user id subset that have not liked that post:

Feature Extraction of false class points (entry points) Training User
ID Set from which sampling As we expect many of them have tags/ categories likelihood close to 0 Likelihood Broadcast Maps the same as true class points

Grab the set of user ids to test and the
blog posts to recommend For each blog post pre-filter only the users that have liked at least once one of the post tags Get final test features just like training Test Features

Final Feature Vectors   Map the case class feature dataset
into a pair of label and vector: RDD[(Double, Vector)]   Concatenate the two feature RDDs (true ++ false)   Persist to disk, or Tachyon ;)

Modeling Simple model Vs. Logistic regression

Reading back the Feature Vectors into DataFrame (Training) Filter out
invalid values of features, they are not going to work in our model

Reading back the Feature Vectors into DataFrame (Test)

TagLikelihood Recommender Return an RDD of (userId, postId, rawLikeScore) where
the rawLikeScore is anything between 0 and 1 Return the binary classification metrics from MlLib where the score is the index 1 of the feature vector, ergo the Tag Likelihood

Logistic Regression Recommender As expected the tagLikelihood weight is the
highest predictor

Evaluation Accuracy of like prediction and recommendation ranking

VizUtils Pimps   We created some custom Pimps for helping
in the built-in visualization of the Notebook   Given an Array[(Double, Double)] representing (X,Y) points:   interpolateLinear(n : Int) = return an array of n points with the x values sampled uniformely   interpolatePercentiles(n: Int) = return an array of n points with the x values corresponding to the percentiles   roundX(nDigits :Int) = round the X values to at most n digits   https://github.com/gm-spacagna/wordpress-posts- recommender/blob/master/src/main/scala/ wordpressworkshop/Pimps.scala

Logistic Regression Vs. Tag Likelihood Evaluation: Recall By Threshold

Logistic Regression Vs. Tag Likelihood Evaluation: Precision By Threshold

Logistic Regression Vs. Tag Likelihood Evaluation: ROC Curve

Mean Average Precision @ N

Logistic Regression Vs. Tag End–to-End Evaluation: MAP@100

Conclusions Limitations and lessons learnt

Limitations   We run our experiments in Spark local mode
since that the size of data was small enough but the implementation is so that it can scale for any size and any cluster.   We did not leverage the DataFrame optimizations for the engineering part but we preferred the flexibility and functionalities of the classic RDDs.   Features independency was not verified, for example we expect tagsLikelihood and categoriesLikelihood features to be highly correlated. PCA or similar techniques could have been adopted to make the solution more reliable.   Whilst the training set was balanced, the test set contained many more false records.   The whole analysis/modeling was performed statically without considering timestamps or sequence of events.   We have not compared how Zeppelin compare to the SparkNotebook in terms of visualizations.

Final Goal   The goal was not proving the correctness
of the solution but showing how you can easily implement and end-to-end scalable and production-quality solution for typical data science problem by leveraging Spark, Scala and the SparkNotebook.

Lessons Learnt (Spark, Dataframe, RDDs)   DataFrame is great for
I/O, schema inference from the sources and when you have flatten schemas. Operations start to be more complicated with nested and array fields.   RDD gives you the flexibility of doing your ETL using the richness of the Scala framework, in the other hand you must be careful on optimizing your execution plans. Functional Programming allowed us to express complex logic with a simple and clear code and free of side effects.   Map joins with broadcast maps is very efficient but we need to make sure to reduce at minimum its size before to broadcast, e.g. applying some filtering to remove the unmatched keys before the join or capping the size of each value in case of size-variable structures (e.g. hash maps).

Lessons Learnt (ML, MlLib)   ETL and feature engineering is
the most time-consuming part, once you obtained the data you want in vector format then you can convert back to DataFrame and use the ML APIs.   ML unfortunately does not wrap everything available in MlLib, sometime you have to convert back to RDD[LabeledPoint] or RDD[(Double, Vector)] in order to use the MlLib features (e.g. evaluation metrics).   ML pipeline API (Transformer, Estimator, Evaluator) seems cool but for an MVP is a pre-mature abstraction.

Lessons Learnt (Modeling)   Do not underestimate simple solutions. In
the worst case they serve as baseline for benchmarking.   Even tough the Logistic Regression was better on classifying as true or false, the simple model outperformed when running the end-to-end ranking evaluation.   Focus on solving problems rather than models or algorithms. Many Data Science problems can be solved with counts and divisions, e.g. Naïve Bayes.   Logistic Regression “raw scores” are NOT probabilities, treat them carefully!

Lessons Learnt (Notebook)   SparkNotebook is good for EDA and
as entry point for calling APIs and presenting results.   Developing in the notebook is non very productive, the more you write code the more become harder to track and refactor previously developed code.   Better writing code in IntelliJ and then either pack it into a fat jar and import it from the notebook or copy and paste every time into a notebook dedicated cell.   In order to keep normal Notebook cells clean, they should not contain more than 4/5 lines of code or complex logic, they should ideally just code queries in the form of functional processing and entry points of a logic API.

Lessons Learnt (Visualization)   Plotting in the notebook with the
built in visualization is handy but very rudimental, can only visualize 25 points, we created a Pimp to take any Array[(Double,Double)] and interpolate its values to only 25 points.   Tip: when you visualize a Scala Map with Double keys in the range 0.0 to 1.0, the take(25) method will return already uniform samples in that range and since the x-axis is numerical, the built-in visualization will automatically sort it for you.   Probably we should have investigated advanced libraries like Bokeh or D3 that are already supported in the Notebook.

Follow-up Links   GitHub page and source code: https://github.com/gm-spacagna/wordpress-posts- recommender/
  Manifesto for Agile Data Science: www.datasciencemanifesto.org   The complete 18 steps to start a new Agile Data Science project: https://datasciencevademecum.wordpress.com/ 2015/11/12/the-complete-18-steps-to-start-a-new- agile-data-science-project/

WordPress Blog Posts Recommender in Spark, Scal...

WordPress Blog Posts Recommender in Spark, Scala and the SparkNotebook

Other Decks in Technology

Featured

Transcript