Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jachnik at Big Data Spain 2015

Real-time User Profiling based on Spark Streaming and HBase Arkadiusz
Jachnik BigData Spain Conference, October 15, 2015

2 Data Scientist AGORA S.A. PhD Student Poznan University of
Technology User Profiling Text Classification Big Data Machine Learning Multi- class classification Multi- label classification Recomendation █ Arkadiusz Jachnik BigData Spain Conference, October 15, 2015

3 Polish Media Company Press Magazines Internet Cinemas Advertising Radio
TV Books BigData Spain Conference, October 15, 2015

4 Agenda 1.  What is user profiling? 2.  User profiling
system using big data technologies (Spark, HBase) –  recipe for profiling system –  algorithm –  technical issues and solutions 3.  Enrichment of user profiles by machine learning methods BigData Spain Conference, October 15, 2015

5 What is a user profile? Tom male has 2
children politics volleyball cars most active on Monday morning City: Cracow Device: iPhone 14 articles read last week BigData Spain Conference, October 15, 2015 Single Customer View

6 Application domains •  Classification issues: –  propensity to buy
–  propensity to churn –  propensity to default (credit scores) –  anomaly detection (e.g., fraud detection) •  User grouping (segmentation) •  Personalised advertising and marketing messaging •  Content personalisation •  Recommendations BigData Spain Conference, October 15, 2015

Our system Introduction BigData Spain Conference, October 15, 2015

8 Our case •  Input data: –  online data: page
views, events –  meta-data of items (articles, blogs, posts, …) •  The problem of data sparsity •  User content engagement in a certain period of time •  Specific user behaviour should be a necessary condition to assign the user to a specific segment (feature) in real-time BigData Spain Conference, October 15, 2015 User behaviour: Segment/feature The user has been reading forum threads about care for young children since last week Parent of 0 to 3-year-old child

9 Workflow BigData Spain Conference, October 15, 2015 Building daily
profiles Daily profiles aggregation and sharing

Our system: Stage 1 Building user daily profiles BigData Spain
Conference, October 15, 2015

11 STEP 1: Tracking and queuing ... BigData Spain Conference,
October 15, 2015

12 User identification •  Main issue: most of our users
are not logged in. •  Requirement: storing only non-PII data. BigData Spain Conference, October 15, 2015 We rely on cookies only.

13 STEP 1: Tracking and queuing •  JavaScript tracks: – 
page views –  events •  Tracking application generates Global User ID (GUID) and session ID •  Data queuing using Apache Kafka: –  Open-source message broker project –  Unified, high-throughput, low- latency platform •  We keep data on Kafka for 3 days. BigData Spain Conference, October 15, 2015 page views, events cookies Tracking application tomcat.apache.org, kafka.apache.org page views stream events stream

14 STEP 2: Spark Streaming ... BigData Spain Conference, October
15, 2015

15 What is Spark Streaming? •  Apache Spark: open source
cluster computing framework. •  Spark Streaming: library for streaming computation as a series of small and deterministic batch jobs: –  Splits stream into batches of X seconds –  Each batch is treated as RDD and is processed by RDD operations –  Processed results are returned in batches BigData Spain Conference, October 15, 2015 live data stream batches of X secons processed results of RDD operations spark.apache.org/streaming, https://databricks-training.s3.amazonaws.com/slides/Spark%20Summit%202014%20-%20Spark%20Streaming.pdf Spark RDD Engine

16 STEP 2: Spark Streaming •  We have 2 streams
(each with 6 partitions) –  page views –  events •  Streaming duration: 5 seconds •  Page views (and events) are converted and parsed to obtain: business ID, domain, parts of URL and referer, geolocation, User-Agent, GUID, Visit ID, etc. ... BigData Spain Conference, October 15, 2015 page views events batches of union of input streams foreachRDD flatMap Engine call(page_view/event) A single page view or event processing

17 STEP 3: Fact Extraction ... BigData Spain Conference, October
15, 2015

18 Fact definition BigData Spain Conference, October 15, 2015 Example:
page view: •  GUID: 123ABC •  time: 2015-10-15, 15:23:15 Example of facts: 123ABC page view on domain wyborcza.pl 2015-10-15 15:23:15 < > 123ABC referer domain google.pl 2015-10-15 15:23:15 < > 123ABC geolocation city Madrid 2015-10-15 15:23:15 < > 123ABC article tag news 2015-10-15 15:23:15 < > 123ABC article tag sport 2015-10-15 15:23:15 < > GUID type of fact value of fact time •  URL: http://www.wyborcza.pl/... •  Referrer: http://www.google.pl/... •  Geo city: Madrid •  Article tags: news, sport

19 Fact definition – more formally Fact – the smallest
piece of information describing a relation between a user (GUID) and some feature/element of a page view or event. Programmer and data steward decide what types of facts should be extracted. BigData Spain Conference, October 15, 2015

20 STEP 4: Profiling algorithm ... BigData Spain Conference, October
15, 2015

21 Profiling rules IF referer contains ‘google.pl’ THEN update feature
‘Search’ by ‘1’ where type of fact to check value to check in fact symbol which segment and how to update if rule is fulfilled Rules can be stored in DB rows: •  type of fact, •  value to check, •  symbol, •  ID of segment to update, •  value to update in segment BigData Spain Conference, October 15, 2015

22 STEP 4: Profiling algorithm BigData Spain Conference, October 15,
2015 123ABC page view on domain wyborcza.pl 15:23 < > Facts to check: 123ABC referer domain google.pl 15:23 < > 123ABC geolocation city Madrid 15:23 < > 123ABC article tag news 15:23 < > 123ABC article tag sport 15:23 < > Profiling rules: IF article tag == ‘news’ THEN update segment ‘News’ by 1 Segments to be updated for GUID 123ABC: IF referer domain == ‘google.pl’ THEN update segment ‘Search’ by ‘Google’

2015 123ABC page view on domain wyborcza.pl 15:23 < > Facts to check: 123ABC referer domain google.pl 15:23 < > 123ABC geolocation city Madrid 15:23 < > 123ABC article tag news 15:23 < > 123ABC article tag sport 15:23 < > Profiling rules: IF article tag == ‘news’ THEN update segment ‘News’ by 1 Segments to be updated for GUID 123ABC: IF referer domain == ‘google.pl’ THEN update segment ‘Search’ by ‘Google’ ! " ‘News’ by value 1

2015 123ABC page view on domain wyborcza.pl 15:23 < > Facts to check: 123ABC referer domain google.pl 15:23 < > 123ABC geolocation city Madrid 15:23 < > 123ABC article tag news 15:23 < > 123ABC article tag sport 15:23 < > Profiling rules: IF article tag == ‘news’ THEN update segment ‘News’ by 1 Segments to be updated for GUID 123ABC: IF referer domain == ‘google.pl’ THEN update segment ‘Search’ by ‘Google’ ‘News’ by value 1 ! ‘Search’ by value ‘Google’

25 STEP 5: Storing profiles in HBase •  Data are
stored by bulk operations in HBase after properly processed Spark batch •  Statistics are stored in Redis HBase •  open source, non-relational, distributed database •  provides BigTable-like capabilities for Hadoop •  fault-tolerant way of storing large quantities of sparse data BigData Spain Conference, October 15, 2015 foreachRDD ... Engine call(page_view/event) Parser Fact Extraction Modules Profiling facts returns segments to be updated Database Manager hbase.apache.org

26 Resources in Spark – tips and tricks •  Resource
managers as a singletons. •  First call() method on a worker initializes: –  singleton with resources (for example database connections), –  shutdown hook which will close all resources on application exit or fault. •  Each worker manages and keeps own resources independently. •  Each resource on each worker is initialized only once. BigData Spain Conference, October 15, 2015 SparkContext Driver Cluster Manager (for example Yarn) Worker Node 1 Executor Task 1 HBase Singleton HBase Connection Worker Node N Executor …

27 Resources in Spark – tips and tricks •  Resource
managers as a singletons. •  First call() method on a worker initializes: –  singleton with resources (for example database connections), –  shutdown hook which will close all resources on application exit or fault. •  Each worker manages and keeps own resources independently. •  Each resource on each worker is initialized only once. BigData Spain Conference, October 15, 2015 SparkContext Driver Cluster Manager (for example Yarn) Worker Node 1 Executor Task 2 HBase Singleton HBase Connection Worker Node N Executor …

28 HBase –row key design •  The rows are sorted
in alphanumeric order by key names •  Hash keys if you want to distribute rows across the regions (on servers of cluster) •  For efficient scanning use some suffixes (separated by dashes): –  For time series data use a timestamp or [year]- [month]-[day]-[hour]-[minute] structure. Our HBase row key format: [GUID]-[year]-[month]-[day] BigData Spain Conference, October 15, 2015 http://hbase.apache.org/0.94/book/rowkey.design.html

Our system: Stage 2 Aggregation of daily user profiles and
sharing BigData Spain Conference, October 15, 2015

30 Architecture •  Final user profiles are shared by REST
web service •  Spring as a web application framework •  Spring Hadoop library for HBase connection management •  Statistics are stored in MySQL database •  We take into account a permission issues: –  aggregated data are divided by business IDs BigData Spain Conference, October 15, 2015 User Profile Web Service Spring Framework Spring Hadoop Library Daily Profiles (Hbase) Config Daily profiles aggregation JSON External system / Client REST query with GUID Profile (JSON) https://spring.io, http://docs.spring.io/spring-hadoop/docs/2.3.0.M3/reference/html/springandhadoop-hbase.html Statistics (MySQL)

31 Profile aggregation algorithm •  For a specified input GUID
we aggregate each existing segment (feature) •  Each segment is aggregated for a specific period of time •  There are many aggregation methods corresponding to different output formats BigData Spain Conference, October 15, 2015 123ABC 2015-09-12 3 Poznan 123ABC 2015-10-13 Gdansk 123ABC 2015-10-14 7 Poznan 123ABC Today 2 Poznan Sport • 7 days • output: true if value>5 City • 14 days • output: mode Example 123ABC AGGREGATED true Poznan 3 1 Number of articles • 3 days • output: sum 3 2

32 Solved issues •  Kafka integration with Spark Streaming • 
Parallelism of data streams (stream division) •  Resources management in Spark •  Processing time •  Security of Spark: Kerberos integration BigData Spain Conference, October 15, 2015

Enrichment of user profiles by machine learning methods How to
classify users to the segments? BigData Spain Conference, October 15, 2015

34 Matching segments to users •  We want to classify
a user (of a specific profile) to another segments –  user vector consists of user’s segments •  All segments are treated as classes (labels) •  Online classification: –  model is learnt real-time –  model can be used for real-time prediction BigData Spain Conference, October 15, 2015 123ABC Football Motorbikes has children Handball Swimming Politics Economy Toys Child car seats Mobile Cars Animals ? ? ? ? ?

35 Multi-label classification •  Learning algorithm: Binary Relevance –  independent
binary classifier for each label (segment) –  each classifier is learnt by existing user profiles (belongs to or not) •  Each model returns boolean or probability •  Prediction algorithm returns results of binary models for a given user profile vector BigData Spain Conference, October 15, 2015 Binary classifier Segment 1 Binary classifier Segment 2 Binary classifier Segment K … Profile 1 Profile 2 Profile N … Prediction algorithm (select ‘1’s or labels with probability>0.5) Profile n List of recommended segments for Profile n Tsoumakas, Grigorios; Katakis, Ioannis (2007). "Multi-label classification: an overview". International Journal of Data Warehousing & Mining 3 (3): 1–13.

36 Online learning by Spark MLlib: Spark’s machine learning (ML)
library •  scalable ML algorithms •  classification •  regression •  clustering •  collaborative filtering •  dimensionality reduction •  lower-level optimization primitives •  higher-level pipeline APIs Streaming linear regression in MLlib •  allows to fit regression models online •  model parameters fitting is similar to that performed offline •  fitting occurs on each batch of data BigData Spain Conference, October 15, 2015 spark.apache.org/mllib

Thank you! Questions? BigData Spain Conference, October 15, 2015

Real-time user profiling based on Spark streami...

Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jachnik at Big Data Spain 2015

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript