Viadeos Segmentation Platform with Spark on Mesos

Viadeo Segmentation Platform with Spark on Mesos Paris Mesos User
Group - 2014/09/10 @EugenCepoi - Viadeo

Two words on Spark - A general purpose distributed computing
framework - Fast, testable, easy to code and deploy - A strong ecosystem is being built around it

Two words on Mesos - A cluster manager responsible of
sharing resources (CPU & RAM) with applications - You can write your own framework for mesos, to run new kind of applications

Spark and Mesos at Viadeo - Started using them together
in mid 2013 - We started to use Spark mainly because of its usability and being a good match to our data set sizes allowing to take full advantage of all its speed - Mesos was the logical best way to run Spark and Hadoop jobs (and other kind of software) on the same nodes and have dynamic resources sharing

Test, Build, Package, Deploy and run Deploying Spark jobs on
Mesos Driver nodes Mesos lib Spark shell Spark driver Debian packaged Job code A driver node is where we deploy the code and from where we launch the jobs on the cluster

Customer Segmentation - Divide a population into a subset of
customers that share common characteristics - Examples of segments: gender, industry, working in a big company… - Used to achieve fine grained targeting (ads, new products, understand the customer) - Send IBM Ads to all male customers older than 40 years and working in the IT

The problem - Business always needs new segments or wants
to combine them - The segmentation was computed through SQL queries on demand - The raw data needs to be preprocessed (cleaned and computed) - Many segments involve computations on the attributes they use, too expensive on MySQL - No way for non IT employees to have the segmentation

The goal - Provide a solution that can compute even
segments on complex data in reasonable time - Make them available to software components and humans (ex: Ad Targeting, AB Testing, BI team) - Expose a service that can answer to segmentation queries in real time * Get live counters & and all the members belonging to a combination of segments * A front-end allowing non IT people to combine segments and query it live

The pragmatic solution - We don’t have the time to
build a real data standardization layer - The business team doing ad-hoc segmentation has “cleaning rules” and knows the MySQL tables - We can’t spend time to change segments definitions - We have conventions The idea: Delegate - Implement the segments definitions as an SQL like DSL doing joins implicitly and taking advantage of the conventions - Let the business team create the segments definitions and define the cleaning rules inside the segment definition

The big picture Segment definition Segmentation Job Stored with conventions:
/sqoop/Member/20140101/* /sqoop/Skills/201401/02/* /sqoop/Company/20140101/* ... Members and their segments Real Time service - Inverted index - Query app Member Skills Company Daily MySQL exports

The segment definition - A DSL focused on expressing constraints
on the data - Doesn’t require the user to write JOINS, we imply it from the primary keys and the column names (remember, conventions) - A segment example: • Define some variables (for ex. patterns you want to see in the data) executiveKeywords: ["Dir.", "Resp.", Directeur, Directrice, Director, Dirigeant, Manager, Responsable, Chef, Chief, Head of] • The segment itself (and reuse available variables) Member.Headline = {executiveKeywords} or Position.StillInPosition=1 and Position.PositionTitle = {executiveKeywords} ...

Segments Definition now()-Member.BirthDate > 30y HDFS sqoops /sqoop/Member/20140101/* /sqoop/Skills/201401/02/* ...
Segmentation Job Parse segment definition using Scala combinators, validate & broadcast to all spark workers Infer the input sources and keep only the attributes we need Prune the data (rows) that won’t change the result of the expressions (reducing the shuffled data size) Join and evaluate each expression (segment) ~ 30M Members + 80 segments +10 Data sources 10g ~ 1g/source Each Member segmentation ~ 2 min to complete Id: 1 Name: Lucas BirthDate: 1986 Id: 2 Name: Joe BirthDate: 1970 Id: 1 BirthDate: 1986 Id: 2 BirthDate: 1970

Querying & in memory index Option 1 : Index in
Elasticsearch and build an app to query it - The most natural one, but at that moment we wanted to test hypotheses and didn’t want to pollute our production Elasticsearch

Querying & in memory index Option 2 : Use a
long running spark job as a service (popular in the Spark community) • In a Spray app launch a spark job and load the data in memory • Submit HTTP requests to the Spray app that will query the in memory RDD - Increases possibility of problems/failures as the service would run 24/24 and would have its data spread across N nodes - Experienced blocking of offered resources by Mesos when running 2+ passive spark shells

Querying & in memory index Option 3 : A service
using an in memory inverted index + a short spark job • At startup, launch a job that will build the index • Collect it on the driver node & stop the job • Submit HTTP requests to the Spray app that will query the in memory index - The quickest solution (for us) to get something running and collect feedback

Pattern Members s1,s3 1, 2 s1,s2 3 s2 4 Group
by existing segmentation patterns Member Segments 1 s1, s3 2 s1, s3 3 s1, s2 4 s2 Fixed number of segments at runtime, map each segment to a position in a Bitset to compute fast set intersections Raw Query: how many in s1 and s2? Query Bitset: 011 Compute the intersections of the query bitset and the index bitsets 1 Inverted in memory index Bitset Members 101 1, 2 011 3 010 4

Segmentation App Segmentation App Spray Service Index construction Job In
memory Index ~200 Mb HDFS /segmentation/20140101/... < 200ms

Today... - The segmentation job and the App, have been
in production for 6 and 3 months, running every day without any trouble, nor requiring an intervention - The computed segmentation is used to display targeted Ads in email campaigns - The Segmentation App, runs 24/24 7/7 and is mainly being used by the sales

Questions?

Thanks :)

Viadeos Segmentation Platform with Spark on Mesos

Viadeos Segmentation Platform with Spark on Mesos

Eugen Cepoi

More Decks by Eugen Cepoi

Other Decks in Technology

Featured

Transcript

Viadeo Segmentation Platform with Spark on Mesos Paris Mesos User

Two words on Spark - A general purpose distributed computing

Two words on Mesos - A cluster manager responsible of

Spark and Mesos at Viadeo - Started using them together

Test, Build, Package, Deploy and run Deploying Spark jobs on

Customer Segmentation - Divide a population into a subset of

The problem - Business always needs new segments or wants

The goal - Provide a solution that can compute even

The pragmatic solution - We don’t have the time to

The big picture Segment definition Segmentation Job Stored with conventions:

The segment definition - A DSL focused on expressing constraints

Segments Definition now()-Member.BirthDate > 30y HDFS sqoops /sqoop/Member/20140101/* /sqoop/Skills/201401/02/* ...

Querying & in memory index Option 1 : Index in

Querying & in memory index Option 2 : Use a

Querying & in memory index Option 3 : A service

Pattern Members s1,s3 1, 2 s1,s2 3 s2 4 Group

Segmentation App Segmentation App Spray Service Index construction Job In

Today... - The segmentation job and the App, have been

Questions?

Thanks :)