Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mining Interest Topics from Plurk

Mining Interest Topics from Plurk

Avatar for Ken Lee

Ken Lee

May 23, 2013
Tweet

More Decks by Ken Lee

Other Decks in Research

Transcript

  1. Outline • Introduction – Why and what we do in

    this thesis? • The SNSD system – Community detection – Interest hierarchy • Implementation – Preprocessing – Celery task queue • Experiments • Conclusions and future works
  2. The Go!Plurk Project Issue: 1. Unable to analysis private user

    2. Pie chart is too simple, no details interest information
  3. Social Networking Service Discovery • Discover users’ interest topics via

    1. Posted contents (plurks) from users 2. Aggregated interest information from communities for the private users • Have to prepare – Relationships – Plurks
  4. Modularity • = (number of edges within communities) - (expected

    number within communities) • Idea: – dense internal connections between the nodes within modules – sparse connections between different modules • Work as a measurement for the quality of partitions and an objective function to optimize.
  5. Definition of Modularity = 1 2 � − 𝑗𝑗 2

    , 𝑗𝑗 – = the weight of the edge between and – = degree of vertex – = 1 2 ∑ 𝑗𝑗 , number of edges of the graph – , = � 1, = 0, 𝑜𝑜 – is the community of vertex
  6. Expected Number of Edges Between Two Nodes • → =

    × 𝑃𝑃 → = × 2 𝑗𝑗 2 Lei Tang, Huan Liu, Community Detection and Mining in Social Media, 2010
  7. Louvain Algorithm • Louvain algorithm is a heuristic greedy method

    based on modularity optimization • Louvain algorithm consists of two phases 1. Look for small communities by optimizing modularity locally 2. Aggregate vertices in the same community and build a new network whose vertices are the communities 3. Repeat until a maximum of modularity is attained
  8. Example • = − 2 • ∆ = − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚

    𝑔𝑔𝑔𝑔 • ∗ = arg max ∆ | ∈ 𝑛𝑛 9 2 8 5 6 7 4 1 3 • 11 = 11 − 11 2 = 0 − 3×3 2×14 = −0.32 • 12 = 12 − 12 2 = 1 − 3×2 2×14 = 0.79 • 13 = 1 − 3×3 2×14 = 0.68 • 14 = 1 − 3×4 2×14 = 0.57 • 15 = 0 − 3×4 2×14 = −0.43 • 16 = 0 − 3×4 2×14 = −0.43 • 17 = 0 − 3×4 2×14 = −0.43 • 18 = 0 − 3×3 2×14 = −0.32 • 19 = 0 − 3×1 2×14 = −0.11 • ∗ 1 = 2
  9. 9 2 8 5 6 7 4 1 3 9

    2 8 5 6 7 4 1 3 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 {1,2} {1,2,3} {1,2,3,4} {5,8} {5,6,8} {7,9} 1 2 3 4 5 6 7 8 9 ∗ 2 1 2 1 8 8 9 5 7
  10. 2 {7,9} {5,6,8} {1,2,3,4} 6 10 2 3 9 2

    8 5 6 7 4 1 3 {1,2,3,4} {5,6,8} {7,9} {1,2,3,4} {5,6,8} {7,9} {{5,6,8}, {7,9}} {1,2,3,4} {5,6,8} {7,9} ∗ {1,2,3,4} {5,6,8} {5,6,8}
  11. 2 {7,9} {5,6,8} {1,2,3,4} 6 10 2 3 {1,2,3,4} {5,6,7,8,9}

    10 14 2 {1,2,3,4} {5,6,7,8,9} ∗ {1,2,3,4} {5,6,7,8,9}
  12. Example (cont.) 9 2 8 5 6 7 4 1

    3 {1,2,3,4} {5,6,7,8,9} original 1st pass, phase 1 2nd pass, terminate 2 {7,9} {5,6,8} {1,2,3,4} 1st pass, phase 2 6 10 10 14 9 2 8 5 6 7 4 1 3 2 3 2
  13. Interest Keywords Hierarchy SNSD Taeyeon Bo Peep Bo Peep Twinkle

    YoonA Gee Girls Generation PSY Gangnam Style
  14. Plurk API • Plurk API 2.0 is based on OAuth

    Core 1.0a standard • Requests should be signed using HMAC-SHA1 • API returns JSON encoded data • No request rate limit
  15. Plurk API Library • Original provider – plurk-oauth by clsung

    • Performance Bottleneck – HTTP persistent connection – JSON decode – HMAC-SHA1 • Enhancements – HTTP connection pool – C extension for JSON and HMAC-SHA1
  16. Performance Comparison 53.71 27.49 15.44 15.50 14.97 13.21 13.13 52.77

    26.74 14.10 11.17 9.45 7.94 7.08 0.00 10.00 20.00 30.00 40.00 50.00 60.00 8 16 32 64 128 256 512 seconds concurrency Original Enhanced
  17. Plurk Attributes • _id – The unique plurk id, used

    for identification of the plurk • owner – The owner/poster of this plurk • content – The formatted and filtered content, e.g. URL will be turned into text tags and emoticons will be filtered etc. • content_raw – The raw content as user entered it • posted – The date this plurk was posted in ISODate format
  18. Datastore Architecture • Why MongoDB? – Auto-sharding – Replica sets

    • MongoDB cluster – mongos – Config servers – Shard servers • Deploy to Delta cloud cluster
  19. Experiment • Sampling 40 public plurkers • public: get top-64

    freq. interest keywords • private: regard the plurker as private, derive his interest keywords by communities and get top-64 freq. interest keywords • len(intersect(public, private))
  20. Result 3 6 7 16 4 3 1 21 ~

    25 26 ~ 30 31 ~ 35 36 ~ 40 41 ~ 45 46 ~ 50 51 ~ 55 0 2 4 6 8 10 12 14 16 18 # matching
  21. Conclusions • Construct an online SNSD system for Plurk users

    to find interesting topics and relationship • Develop a new scalable crawling framework based on ZeroMQ • Patch the plurk-oauth library • Build a website for visualizing interest and relationship by D3.js
  22. Future Works • Interest hierarchy: – Manageable UI – Recommend

    by users • Apply the SNSD system to Twitter for western language and Sina weibo for mainland China • Employ other community dectection algorithm and optimize NetworkX
  23. Future Works (cont.) • Consider responses in a plurk and

    fans relationship in interest derivation • Serve as a Plurk full-text search engine