Mining Interest Topics from Plurk

Mining Interest Topics from Plurk

4eb46cd24caa41d79892b023a6f96f3e?s=128

Ken Lee

May 23, 2013
Tweet

Transcript

  1. Mining Interest Topics from Plurk Ken Yi-Chien Lee 2012/11/27

  2. Outline • Introduction – Why and what we do in

    this thesis? • The SNSD system – Community detection – Interest hierarchy • Implementation – Preprocessing – Celery task queue • Experiments • Conclusions and future works
  3. INTRODUCTION I what to make friend with you.

  4. Scenario

  5. Scenario (cont.)

  6. Plurk Timeline

  7. Private Status

  8. Traffic Statistics of Plurk in Taiwan

  9. The Go!Plurk Project Issue: 1. Unable to analysis private user

    2. Pie chart is too simple, no details interest information
  10. THE SNSD SYSTEM Find out what the plurker is interested

    in.
  11. Social Networking Service Discovery • Discover users’ interest topics via

    1. Posted contents (plurks) from users 2. Aggregated interest information from communities for the private users • Have to prepare – Relationships – Plurks
  12. Work-flow of SNSD System

  13. Aggregation and Derivation

  14. Aggregation and Derivation

  15. Aggregation and Derivation

  16. Aggregation and Derivation

  17. Aggregation and Derivation

  18. Aggregation and Derivation

  19. Community Detection • Snowball sampling • Louvain algorithm • Filtering

    – Karma – Gender – Privacy
  20. Snowball Sampling

  21. Modularity • = (number of edges within communities) - (expected

    number within communities) • Idea: – dense internal connections between the nodes within modules – sparse connections between different modules • Work as a measurement for the quality of partitions and an objective function to optimize.
  22. Definition of Modularity = 1 2 � − 𝑗𝑗 2

    , 𝑗𝑗 – = the weight of the edge between and – = degree of vertex – = 1 2 ∑ 𝑗𝑗 , number of edges of the graph – , = � 1, = 0, 𝑜𝑜 – is the community of vertex
  23. Expected Number of Edges Between Two Nodes • → =

    × 𝑃𝑃 → = × 2 𝑗𝑗 2 Lei Tang, Huan Liu, Community Detection and Mining in Social Media, 2010
  24. Louvain Algorithm • Louvain algorithm is a heuristic greedy method

    based on modularity optimization • Louvain algorithm consists of two phases 1. Look for small communities by optimizing modularity locally 2. Aggregate vertices in the same community and build a new network whose vertices are the communities 3. Repeat until a maximum of modularity is attained
  25. Example • = − 2 • ∆ = − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚

    𝑔𝑔𝑔𝑔 • ∗ = arg max ∆ | ∈ 𝑛𝑛 9 2 8 5 6 7 4 1 3 • 11 = 11 − 11 2 = 0 − 3×3 2×14 = −0.32 • 12 = 12 − 12 2 = 1 − 3×2 2×14 = 0.79 • 13 = 1 − 3×3 2×14 = 0.68 • 14 = 1 − 3×4 2×14 = 0.57 • 15 = 0 − 3×4 2×14 = −0.43 • 16 = 0 − 3×4 2×14 = −0.43 • 17 = 0 − 3×4 2×14 = −0.43 • 18 = 0 − 3×3 2×14 = −0.32 • 19 = 0 − 3×1 2×14 = −0.11 • ∗ 1 = 2
  26. 9 2 8 5 6 7 4 1 3 9

    2 8 5 6 7 4 1 3 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 {1,2} {1,2,3} {1,2,3,4} {5,8} {5,6,8} {7,9} 1 2 3 4 5 6 7 8 9 ∗ 2 1 2 1 8 8 9 5 7
  27. 2 {7,9} {5,6,8} {1,2,3,4} 6 10 2 3 9 2

    8 5 6 7 4 1 3 {1,2,3,4} {5,6,8} {7,9} {1,2,3,4} {5,6,8} {7,9} {{5,6,8}, {7,9}} {1,2,3,4} {5,6,8} {7,9} ∗ {1,2,3,4} {5,6,8} {5,6,8}
  28. 2 {7,9} {5,6,8} {1,2,3,4} 6 10 2 3 {1,2,3,4} {5,6,7,8,9}

    10 14 2 {1,2,3,4} {5,6,7,8,9} ∗ {1,2,3,4} {5,6,7,8,9}
  29. Example (cont.) 9 2 8 5 6 7 4 1

    3 {1,2,3,4} {5,6,7,8,9} original 1st pass, phase 1 2nd pass, terminate 2 {7,9} {5,6,8} {1,2,3,4} 1st pass, phase 2 6 10 10 14 9 2 8 5 6 7 4 1 3 2 3 2
  30. INTEREST KEYWORDS HIERARCHY

  31. Closure Table

  32. Interest Keywords Hierarchy SNSD Taeyeon Bo Peep Bo Peep Twinkle

    YoonA Gee Girls Generation PSY Gangnam Style
  33. CRAWLING SYSTEM How to dump Plurk.com?

  34. None
  35. Overview of Crawling System

  36. ZeroMQ: The Intelligent Transport Layer

  37. Work-flow of Crawling Task Queue

  38. Plurk API • Plurk API 2.0 is based on OAuth

    Core 1.0a standard • Requests should be signed using HMAC-SHA1 • API returns JSON encoded data • No request rate limit
  39. Plurk API Library • Original provider – plurk-oauth by clsung

    • Performance Bottleneck – HTTP persistent connection – JSON decode – HMAC-SHA1 • Enhancements – HTTP connection pool – C extension for JSON and HMAC-SHA1
  40. Performance Comparison 53.71 27.49 15.44 15.50 14.97 13.21 13.13 52.77

    26.74 14.10 11.17 9.45 7.94 7.08 0.00 10.00 20.00 30.00 40.00 50.00 60.00 8 16 32 64 128 256 512 seconds concurrency Original Enhanced
  41. An Example of a Plurk

  42. Plurk Attributes • _id – The unique plurk id, used

    for identification of the plurk • owner – The owner/poster of this plurk • content – The formatted and filtered content, e.g. URL will be turned into text tags and emoticons will be filtered etc. • content_raw – The raw content as user entered it • posted – The date this plurk was posted in ISODate format
  43. Plurks Preprocessing

  44. URL Filtering

  45. URL Filtering (cont.)

  46. URL Filtering (cont.)

  47. Normalization

  48. Tokenization

  49. Celery Task Queue

  50. Celery Task Queue

  51. Datastore Architecture • Why MongoDB? – Auto-sharding – Replica sets

    • MongoDB cluster – mongos – Config servers – Shard servers • Deploy to Delta cloud cluster
  52. MongoDB Server Layout

  53. Cluster Configuration

  54. Delta Cloud Server

  55. Delta Cloud Server (cont.)

  56. EXPERIMENTS

  57. Environment

  58. Experiment • Sampling 40 public plurkers • public: get top-64

    freq. interest keywords • private: regard the plurker as private, derive his interest keywords by communities and get top-64 freq. interest keywords • len(intersect(public, private))
  59. Result 3 6 7 16 4 3 1 21 ~

    25 26 ~ 30 31 ~ 35 36 ~ 40 41 ~ 45 46 ~ 50 51 ~ 55 0 2 4 6 8 10 12 14 16 18 # matching
  60. LIVE DEMO Never

  61. None
  62. None
  63. None
  64. None
  65. None
  66. None
  67. None
  68. CONCLUSIONS AND FUTURE WORKS

  69. Conclusions • Construct an online SNSD system for Plurk users

    to find interesting topics and relationship • Develop a new scalable crawling framework based on ZeroMQ • Patch the plurk-oauth library • Build a website for visualizing interest and relationship by D3.js
  70. Future Works • Interest hierarchy: – Manageable UI – Recommend

    by users • Apply the SNSD system to Twitter for western language and Sina weibo for mainland China • Employ other community dectection algorithm and optimize NetworkX
  71. Future Works (cont.) • Consider responses in a plurk and

    fans relationship in interest derivation • Serve as a Plurk full-text search engine
  72. Q & A Thank you for listening.

  73. CS Workstation Architecture

  74. Delta Cluster Architecture