Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-time user profiling based on Spark streami...

Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jachnik at Big Data Spain 2015

Agora owns dozens of themed, classified, entertainment and social services. There are news and sports portals, forums, advertising services, blogs and many other thematic websites. All sites generate over 400 page views per second (under normal conditions) and considerably more events (likes focus, clicks and scrolling events). It raises one question: how to build user profiles real-time in such a dynamic and changing environment?

Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-16.html

Big Data Spain

October 22, 2015
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. Real-time User Profiling based on Spark Streaming and HBase Arkadiusz

    Jachnik BigData Spain Conference, October 15, 2015
  2. 2 Data Scientist AGORA S.A. PhD Student Poznan University of

    Technology User Profiling Text Classification Big Data Machine Learning Multi- class classification Multi- label classification Recomendation █ Arkadiusz Jachnik BigData Spain Conference, October 15, 2015
  3. 3 Polish Media Company Press Magazines Internet Cinemas Advertising Radio

    TV Books BigData Spain Conference, October 15, 2015
  4. 4 Agenda 1.  What is user profiling? 2.  User profiling

    system using big data technologies (Spark, HBase) –  recipe for profiling system –  algorithm –  technical issues and solutions 3.  Enrichment of user profiles by machine learning methods BigData Spain Conference, October 15, 2015
  5. 5 What is a user profile? Tom male has 2

    children politics volleyball cars most active on Monday morning City: Cracow Device: iPhone 14 articles read last week BigData Spain Conference, October 15, 2015 Single Customer View
  6. 6 Application domains •  Classification issues: –  propensity to buy

    –  propensity to churn –  propensity to default (credit scores) –  anomaly detection (e.g., fraud detection) •  User grouping (segmentation) •  Personalised advertising and marketing messaging •  Content personalisation •  Recommendations BigData Spain Conference, October 15, 2015
  7. 8 Our case •  Input data: –  online data: page

    views, events –  meta-data of items (articles, blogs, posts, …) •  The problem of data sparsity •  User content engagement in a certain period of time •  Specific user behaviour should be a necessary condition to assign the user to a specific segment (feature) in real-time BigData Spain Conference, October 15, 2015 User behaviour: Segment/feature The user has been reading forum threads about care for young children since last week Parent of 0 to 3-year-old child
  8. 9 Workflow BigData Spain Conference, October 15, 2015 Building daily

    profiles Daily profiles aggregation and sharing
  9. 12 User identification •  Main issue: most of our users

    are not logged in. •  Requirement: storing only non-PII data. BigData Spain Conference, October 15, 2015 We rely on cookies only.
  10. 13 STEP 1: Tracking and queuing •  JavaScript tracks: – 

    page views –  events •  Tracking application generates Global User ID (GUID) and session ID •  Data queuing using Apache Kafka: –  Open-source message broker project –  Unified, high-throughput, low- latency platform •  We keep data on Kafka for 3 days. BigData Spain Conference, October 15, 2015 page views, events cookies Tracking application tomcat.apache.org, kafka.apache.org page views stream events stream
  11. 15 What is Spark Streaming? •  Apache Spark: open source

    cluster computing framework. •  Spark Streaming: library for streaming computation as a series of small and deterministic batch jobs: –  Splits stream into batches of X seconds –  Each batch is treated as RDD and is processed by RDD operations –  Processed results are returned in batches BigData Spain Conference, October 15, 2015 live data stream batches of X secons processed results of RDD operations spark.apache.org/streaming, https://databricks-training.s3.amazonaws.com/slides/Spark%20Summit%202014%20-%20Spark%20Streaming.pdf Spark RDD Engine
  12. 16 STEP 2: Spark Streaming •  We have 2 streams

    (each with 6 partitions) –  page views –  events •  Streaming duration: 5 seconds •  Page views (and events) are converted and parsed to obtain: business ID, domain, parts of URL and referer, geolocation, User-Agent, GUID, Visit ID, etc. ... BigData Spain Conference, October 15, 2015 page views events batches of union of input streams foreachRDD flatMap Engine call(page_view/event) A single page view or event processing
  13. 18 Fact definition BigData Spain Conference, October 15, 2015 Example:

    page view: •  GUID: 123ABC •  time: 2015-10-15, 15:23:15 Example of facts: 123ABC page view on domain wyborcza.pl 2015-10-15 15:23:15 < > 123ABC referer domain google.pl 2015-10-15 15:23:15 < > 123ABC geolocation city Madrid 2015-10-15 15:23:15 < > 123ABC article tag news 2015-10-15 15:23:15 < > 123ABC article tag sport 2015-10-15 15:23:15 < > GUID type of fact value of fact time •  URL: http://www.wyborcza.pl/... •  Referrer: http://www.google.pl/... •  Geo city: Madrid •  Article tags: news, sport
  14. 19 Fact definition – more formally Fact – the smallest

    piece of information describing a relation between a user (GUID) and some feature/element of a page view or event. Programmer and data steward decide what types of facts should be extracted. BigData Spain Conference, October 15, 2015
  15. 21 Profiling rules IF referer contains ‘google.pl’ THEN update feature

    ‘Search’ by ‘1’ where type of fact to check value to check in fact symbol which segment and how to update if rule is fulfilled Rules can be stored in DB rows: •  type of fact, •  value to check, •  symbol, •  ID of segment to update, •  value to update in segment BigData Spain Conference, October 15, 2015
  16. 22 STEP 4: Profiling algorithm BigData Spain Conference, October 15,

    2015 123ABC page view on domain wyborcza.pl 15:23 < > Facts to check: 123ABC referer domain google.pl 15:23 < > 123ABC geolocation city Madrid 15:23 < > 123ABC article tag news 15:23 < > 123ABC article tag sport 15:23 < > Profiling rules: IF article tag == ‘news’ THEN update segment ‘News’ by 1 Segments to be updated for GUID 123ABC: IF referer domain == ‘google.pl’ THEN update segment ‘Search’ by ‘Google’
  17. 23 STEP 4: Profiling algorithm BigData Spain Conference, October 15,

    2015 123ABC page view on domain wyborcza.pl 15:23 < > Facts to check: 123ABC referer domain google.pl 15:23 < > 123ABC geolocation city Madrid 15:23 < > 123ABC article tag news 15:23 < > 123ABC article tag sport 15:23 < > Profiling rules: IF article tag == ‘news’ THEN update segment ‘News’ by 1 Segments to be updated for GUID 123ABC: IF referer domain == ‘google.pl’ THEN update segment ‘Search’ by ‘Google’ ! " ‘News’ by value 1
  18. 24 STEP 4: Profiling algorithm BigData Spain Conference, October 15,

    2015 123ABC page view on domain wyborcza.pl 15:23 < > Facts to check: 123ABC referer domain google.pl 15:23 < > 123ABC geolocation city Madrid 15:23 < > 123ABC article tag news 15:23 < > 123ABC article tag sport 15:23 < > Profiling rules: IF article tag == ‘news’ THEN update segment ‘News’ by 1 Segments to be updated for GUID 123ABC: IF referer domain == ‘google.pl’ THEN update segment ‘Search’ by ‘Google’ ‘News’ by value 1 ! ‘Search’ by value ‘Google’
  19. 25 STEP 5: Storing profiles in HBase •  Data are

    stored by bulk operations in HBase after properly processed Spark batch •  Statistics are stored in Redis HBase •  open source, non-relational, distributed database •  provides BigTable-like capabilities for Hadoop •  fault-tolerant way of storing large quantities of sparse data BigData Spain Conference, October 15, 2015 foreachRDD ... Engine call(page_view/event) Parser Fact Extraction Modules Profiling facts returns segments to be updated Database Manager hbase.apache.org
  20. 26 Resources in Spark – tips and tricks •  Resource

    managers as a singletons. •  First call() method on a worker initializes: –  singleton with resources (for example database connections), –  shutdown hook which will close all resources on application exit or fault. •  Each worker manages and keeps own resources independently. •  Each resource on each worker is initialized only once. BigData Spain Conference, October 15, 2015 SparkContext Driver Cluster Manager (for example Yarn) Worker Node 1 Executor Task 1 HBase Singleton HBase Connection Worker Node N Executor …
  21. 27 Resources in Spark – tips and tricks •  Resource

    managers as a singletons. •  First call() method on a worker initializes: –  singleton with resources (for example database connections), –  shutdown hook which will close all resources on application exit or fault. •  Each worker manages and keeps own resources independently. •  Each resource on each worker is initialized only once. BigData Spain Conference, October 15, 2015 SparkContext Driver Cluster Manager (for example Yarn) Worker Node 1 Executor Task 2 HBase Singleton HBase Connection Worker Node N Executor …
  22. 28 HBase –row key design •  The rows are sorted

    in alphanumeric order by key names •  Hash keys if you want to distribute rows across the regions (on servers of cluster) •  For efficient scanning use some suffixes (separated by dashes): –  For time series data use a timestamp or [year]- [month]-[day]-[hour]-[minute] structure. Our HBase row key format: [GUID]-[year]-[month]-[day] BigData Spain Conference, October 15, 2015 http://hbase.apache.org/0.94/book/rowkey.design.html
  23. Our system: Stage 2 Aggregation of daily user profiles and

    sharing BigData Spain Conference, October 15, 2015
  24. 30 Architecture •  Final user profiles are shared by REST

    web service •  Spring as a web application framework •  Spring Hadoop library for HBase connection management •  Statistics are stored in MySQL database •  We take into account a permission issues: –  aggregated data are divided by business IDs BigData Spain Conference, October 15, 2015 User Profile Web Service Spring Framework Spring Hadoop Library Daily Profiles (Hbase) Config Daily profiles aggregation JSON External system / Client REST query with GUID Profile (JSON) https://spring.io, http://docs.spring.io/spring-hadoop/docs/2.3.0.M3/reference/html/springandhadoop-hbase.html Statistics (MySQL)
  25. 31 Profile aggregation algorithm •  For a specified input GUID

    we aggregate each existing segment (feature) •  Each segment is aggregated for a specific period of time •  There are many aggregation methods corresponding to different output formats BigData Spain Conference, October 15, 2015 123ABC 2015-09-12 3 Poznan 123ABC 2015-10-13 Gdansk 123ABC 2015-10-14 7 Poznan 123ABC Today 2 Poznan Sport • 7 days • output: true if value>5 City • 14 days • output: mode Example 123ABC AGGREGATED true Poznan 3 1 Number of articles • 3 days • output: sum 3 2
  26. 32 Solved issues •  Kafka integration with Spark Streaming • 

    Parallelism of data streams (stream division) •  Resources management in Spark •  Processing time •  Security of Spark: Kerberos integration BigData Spain Conference, October 15, 2015
  27. Enrichment of user profiles by machine learning methods How to

    classify users to the segments? BigData Spain Conference, October 15, 2015
  28. 34 Matching segments to users •  We want to classify

    a user (of a specific profile) to another segments –  user vector consists of user’s segments •  All segments are treated as classes (labels) •  Online classification: –  model is learnt real-time –  model can be used for real-time prediction BigData Spain Conference, October 15, 2015 123ABC Football Motorbikes has children Handball Swimming Politics Economy Toys Child car seats Mobile Cars Animals ? ? ? ? ?
  29. 35 Multi-label classification •  Learning algorithm: Binary Relevance –  independent

    binary classifier for each label (segment) –  each classifier is learnt by existing user profiles (belongs to or not) •  Each model returns boolean or probability •  Prediction algorithm returns results of binary models for a given user profile vector BigData Spain Conference, October 15, 2015 Binary classifier Segment 1 Binary classifier Segment 2 Binary classifier Segment K … Profile 1 Profile 2 Profile N … Prediction algorithm (select ‘1’s or labels with probability>0.5) Profile n List of recommended segments for Profile n Tsoumakas, Grigorios; Katakis, Ioannis (2007). "Multi-label classification: an overview". International Journal of Data Warehousing & Mining 3 (3): 1–13.
  30. 36 Online learning by Spark MLlib: Spark’s machine learning (ML)

    library •  scalable ML algorithms •  classification •  regression •  clustering •  collaborative filtering •  dimensionality reduction •  lower-level optimization primitives •  higher-level pipeline APIs Streaming linear regression in MLlib •  allows to fit regression models online •  model parameters fitting is similar to that performed offline •  fitting occurs on each batch of data BigData Spain Conference, October 15, 2015 spark.apache.org/mllib