Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mining Interest Topics from Plurk

Mining Interest Topics from Plurk

Ken Lee

May 23, 2013
Tweet

More Decks by Ken Lee

Other Decks in Research

Transcript

  1. Outline • Introduction – Why and what we do in

    this thesis? • The SNSD system – Community detection – Interest hierarchy • Implementation – Preprocessing – Celery task queue • Experiments • Conclusions and future works
  2. The Go!Plurk Project Issue: 1. Unable to analysis private user

    2. Pie chart is too simple, no details interest information
  3. Social Networking Service Discovery • Discover users’ interest topics via

    1. Posted contents (plurks) from users 2. Aggregated interest information from communities for the private users • Have to prepare – Relationships – Plurks
  4. Modularity • = (number of edges within communities) - (expected

    number within communities) • Idea: – dense internal connections between the nodes within modules – sparse connections between different modules • Work as a measurement for the quality of partitions and an objective function to optimize.
  5. Definition of Modularity = 1 2 � − 𝑗𝑗 2

    , 𝑗𝑗 – = the weight of the edge between and – = degree of vertex – = 1 2 ∑ 𝑗𝑗 , number of edges of the graph – , = � 1, = 0, 𝑜𝑜 – is the community of vertex
  6. Expected Number of Edges Between Two Nodes • → =

    × 𝑃𝑃 → = × 2 𝑗𝑗 2 Lei Tang, Huan Liu, Community Detection and Mining in Social Media, 2010
  7. Louvain Algorithm • Louvain algorithm is a heuristic greedy method

    based on modularity optimization • Louvain algorithm consists of two phases 1. Look for small communities by optimizing modularity locally 2. Aggregate vertices in the same community and build a new network whose vertices are the communities 3. Repeat until a maximum of modularity is attained
  8. Example • = − 2 • ∆ = − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚

    𝑔𝑔𝑔𝑔 • ∗ = arg max ∆ | ∈ 𝑛𝑛 9 2 8 5 6 7 4 1 3 • 11 = 11 − 11 2 = 0 − 3×3 2×14 = −0.32 • 12 = 12 − 12 2 = 1 − 3×2 2×14 = 0.79 • 13 = 1 − 3×3 2×14 = 0.68 • 14 = 1 − 3×4 2×14 = 0.57 • 15 = 0 − 3×4 2×14 = −0.43 • 16 = 0 − 3×4 2×14 = −0.43 • 17 = 0 − 3×4 2×14 = −0.43 • 18 = 0 − 3×3 2×14 = −0.32 • 19 = 0 − 3×1 2×14 = −0.11 • ∗ 1 = 2
  9. 9 2 8 5 6 7 4 1 3 9

    2 8 5 6 7 4 1 3 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 {1,2} {1,2,3} {1,2,3,4} {5,8} {5,6,8} {7,9} 1 2 3 4 5 6 7 8 9 ∗ 2 1 2 1 8 8 9 5 7
  10. 2 {7,9} {5,6,8} {1,2,3,4} 6 10 2 3 9 2

    8 5 6 7 4 1 3 {1,2,3,4} {5,6,8} {7,9} {1,2,3,4} {5,6,8} {7,9} {{5,6,8}, {7,9}} {1,2,3,4} {5,6,8} {7,9} ∗ {1,2,3,4} {5,6,8} {5,6,8}
  11. 2 {7,9} {5,6,8} {1,2,3,4} 6 10 2 3 {1,2,3,4} {5,6,7,8,9}

    10 14 2 {1,2,3,4} {5,6,7,8,9} ∗ {1,2,3,4} {5,6,7,8,9}
  12. Example (cont.) 9 2 8 5 6 7 4 1

    3 {1,2,3,4} {5,6,7,8,9} original 1st pass, phase 1 2nd pass, terminate 2 {7,9} {5,6,8} {1,2,3,4} 1st pass, phase 2 6 10 10 14 9 2 8 5 6 7 4 1 3 2 3 2
  13. Interest Keywords Hierarchy SNSD Taeyeon Bo Peep Bo Peep Twinkle

    YoonA Gee Girls Generation PSY Gangnam Style
  14. Plurk API • Plurk API 2.0 is based on OAuth

    Core 1.0a standard • Requests should be signed using HMAC-SHA1 • API returns JSON encoded data • No request rate limit
  15. Plurk API Library • Original provider – plurk-oauth by clsung

    • Performance Bottleneck – HTTP persistent connection – JSON decode – HMAC-SHA1 • Enhancements – HTTP connection pool – C extension for JSON and HMAC-SHA1
  16. Performance Comparison 53.71 27.49 15.44 15.50 14.97 13.21 13.13 52.77

    26.74 14.10 11.17 9.45 7.94 7.08 0.00 10.00 20.00 30.00 40.00 50.00 60.00 8 16 32 64 128 256 512 seconds concurrency Original Enhanced
  17. Plurk Attributes • _id – The unique plurk id, used

    for identification of the plurk • owner – The owner/poster of this plurk • content – The formatted and filtered content, e.g. URL will be turned into text tags and emoticons will be filtered etc. • content_raw – The raw content as user entered it • posted – The date this plurk was posted in ISODate format
  18. Datastore Architecture • Why MongoDB? – Auto-sharding – Replica sets

    • MongoDB cluster – mongos – Config servers – Shard servers • Deploy to Delta cloud cluster
  19. Experiment • Sampling 40 public plurkers • public: get top-64

    freq. interest keywords • private: regard the plurker as private, derive his interest keywords by communities and get top-64 freq. interest keywords • len(intersect(public, private))
  20. Result 3 6 7 16 4 3 1 21 ~

    25 26 ~ 30 31 ~ 35 36 ~ 40 41 ~ 45 46 ~ 50 51 ~ 55 0 2 4 6 8 10 12 14 16 18 # matching
  21. Conclusions • Construct an online SNSD system for Plurk users

    to find interesting topics and relationship • Develop a new scalable crawling framework based on ZeroMQ • Patch the plurk-oauth library • Build a website for visualizing interest and relationship by D3.js
  22. Future Works • Interest hierarchy: – Manageable UI – Recommend

    by users • Apply the SNSD system to Twitter for western language and Sina weibo for mainland China • Employ other community dectection algorithm and optimize NetworkX
  23. Future Works (cont.) • Consider responses in a plurk and

    fans relationship in interest derivation • Serve as a Plurk full-text search engine