Ken Lee
May 23, 2013

# Mining Interest Topics from Plurk

May 23, 2013

## Transcript

2. ### Outline • Introduction – Why and what we do in

this thesis? • The SNSD system – Community detection – Interest hierarchy • Implementation – Preprocessing – Celery task queue • Experiments • Conclusions and future works

9. ### The Go!Plurk Project Issue: 1. Unable to analysis private user

2. Pie chart is too simple, no details interest information

in.
11. ### Social Networking Service Discovery • Discover users’ interest topics via

1. Posted contents (plurks) from users 2. Aggregated interest information from communities for the private users • Have to prepare – Relationships – Plurks

19. ### Community Detection • Snowball sampling • Louvain algorithm • Filtering

– Karma – Gender – Privacy

21. ### Modularity • = (number of edges within communities) - (expected

number within communities) • Idea: – dense internal connections between the nodes within modules – sparse connections between different modules • Work as a measurement for the quality of partitions and an objective function to optimize.
22. ### Definition of Modularity = 1 2 � − 𝑗𝑗 2

, 𝑗𝑗 – = the weight of the edge between and – = degree of vertex – = 1 2 ∑ 𝑗𝑗 , number of edges of the graph – , = � 1, = 0, 𝑜𝑜 – is the community of vertex
23. ### Expected Number of Edges Between Two Nodes • → =

× 𝑃𝑃 → = × 2 𝑗𝑗 2 Lei Tang, Huan Liu, Community Detection and Mining in Social Media, 2010
24. ### Louvain Algorithm • Louvain algorithm is a heuristic greedy method

based on modularity optimization • Louvain algorithm consists of two phases 1. Look for small communities by optimizing modularity locally 2. Aggregate vertices in the same community and build a new network whose vertices are the communities 3. Repeat until a maximum of modularity is attained
25. ### Example • = − 2 • ∆ = − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚

𝑔𝑔𝑔𝑔 • ∗ = arg max ∆ | ∈ 𝑛𝑛 9 2 8 5 6 7 4 1 3 • 11 = 11 − 11 2 = 0 − 3×3 2×14 = −0.32 • 12 = 12 − 12 2 = 1 − 3×2 2×14 = 0.79 • 13 = 1 − 3×3 2×14 = 0.68 • 14 = 1 − 3×4 2×14 = 0.57 • 15 = 0 − 3×4 2×14 = −0.43 • 16 = 0 − 3×4 2×14 = −0.43 • 17 = 0 − 3×4 2×14 = −0.43 • 18 = 0 − 3×3 2×14 = −0.32 • 19 = 0 − 3×1 2×14 = −0.11 • ∗ 1 = 2
26. ### 9 2 8 5 6 7 4 1 3 9

2 8 5 6 7 4 1 3 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 {1,2} {1,2,3} {1,2,3,4} {5,8} {5,6,8} {7,9} 1 2 3 4 5 6 7 8 9 ∗ 2 1 2 1 8 8 9 5 7
27. ### 2 {7,9} {5,6,8} {1,2,3,4} 6 10 2 3 9 2

8 5 6 7 4 1 3 {1,2,3,4} {5,6,8} {7,9} {1,2,3,4} {5,6,8} {7,9} {{5,6,8}, {7,9}} {1,2,3,4} {5,6,8} {7,9} ∗ {1,2,3,4} {5,6,8} {5,6,8}
28. ### 2 {7,9} {5,6,8} {1,2,3,4} 6 10 2 3 {1,2,3,4} {5,6,7,8,9}

10 14 2 {1,2,3,4} {5,6,7,8,9} ∗ {1,2,3,4} {5,6,7,8,9}
29. ### Example (cont.) 9 2 8 5 6 7 4 1

3 {1,2,3,4} {5,6,7,8,9} original 1st pass, phase 1 2nd pass, terminate 2 {7,9} {5,6,8} {1,2,3,4} 1st pass, phase 2 6 10 10 14 9 2 8 5 6 7 4 1 3 2 3 2

32. ### Interest Keywords Hierarchy SNSD Taeyeon Bo Peep Bo Peep Twinkle

YoonA Gee Girls Generation PSY Gangnam Style

34. None

38. ### Plurk API • Plurk API 2.0 is based on OAuth

Core 1.0a standard • Requests should be signed using HMAC-SHA1 • API returns JSON encoded data • No request rate limit
39. ### Plurk API Library • Original provider – plurk-oauth by clsung

• Performance Bottleneck – HTTP persistent connection – JSON decode – HMAC-SHA1 • Enhancements – HTTP connection pool – C extension for JSON and HMAC-SHA1
40. ### Performance Comparison 53.71 27.49 15.44 15.50 14.97 13.21 13.13 52.77

26.74 14.10 11.17 9.45 7.94 7.08 0.00 10.00 20.00 30.00 40.00 50.00 60.00 8 16 32 64 128 256 512 seconds concurrency Original Enhanced

42. ### Plurk Attributes • _id – The unique plurk id, used

for identification of the plurk • owner – The owner/poster of this plurk • content – The formatted and filtered content, e.g. URL will be turned into text tags and emoticons will be filtered etc. • content_raw – The raw content as user entered it • posted – The date this plurk was posted in ISODate format

51. ### Datastore Architecture • Why MongoDB? – Auto-sharding – Replica sets

• MongoDB cluster – mongos – Config servers – Shard servers • Deploy to Delta cloud cluster

58. ### Experiment • Sampling 40 public plurkers • public: get top-64

freq. interest keywords • private: regard the plurker as private, derive his interest keywords by communities and get top-64 freq. interest keywords • len(intersect(public, private))
59. ### Result 3 6 7 16 4 3 1 21 ~

25 26 ~ 30 31 ~ 35 36 ~ 40 41 ~ 45 46 ~ 50 51 ~ 55 0 2 4 6 8 10 12 14 16 18 # matching

61. None
62. None
63. None
64. None
65. None
66. None
67. None

69. ### Conclusions • Construct an online SNSD system for Plurk users

to find interesting topics and relationship • Develop a new scalable crawling framework based on ZeroMQ • Patch the plurk-oauth library • Build a website for visualizing interest and relationship by D3.js
70. ### Future Works • Interest hierarchy: – Manageable UI – Recommend

by users • Apply the SNSD system to Twitter for western language and Sina weibo for mainland China • Employ other community dectection algorithm and optimize NetworkX
71. ### Future Works (cont.) • Consider responses in a plurk and

fans relationship in interest derivation • Serve as a Plurk full-text search engine