Slide 1

Slide 1 text

CLUEWEB12++ SHRIPHANI PALAKODETY [email protected]

Slide 2

Slide 2 text

BACKGROUND •  CLUEWEB12 – Crawled February 2012 – May 2012 – 720m web pages – Archival Web – Shipped early 2013

Slide 3

Slide 3 text

CLUEWEB12 TRIVIA •  10mn Pages / Day •  7 Machines – 64 GB Memory Each •  2.8 TB •  Archival Corpus

Slide 4

Slide 4 text

EPHEMERAL WEB •  Social Media •  Discussion Forums •  Blogs •  Newsgroups •  Product reviews

Slide 5

Slide 5 text

GOAL Snapshot of ephemeral web from February – May 2012

Slide 6

Slide 6 text

DISCUSSIONS •  Discussion Forums – Examples: Gaia Online, final-gear •  Social Media – Examples: Reddit, StackOverflow •  Newsgroups •  Product Reviews •  Article / Blog comments

Slide 7

Slide 7 text

WEB CRAWL •  Traditionally: – Seed Page (get out-links) •  Process out-links •  Download processed links •  Repeat till hop limit •  CLUEWEB - 9 / 12 used this

Slide 8

Slide 8 text

CRAWLING DISCUSSION FORUMS •  Seeds list: – No curated (or otherwise) forum list – No clear starting point •  Temporal restrictions: – Do not want entire site – just a slice •  General crawling issues

Slide 9

Slide 9 text

SEEDS LIST •  Start with small compiled site lists •  Search a large web corpus –  Search a web-search engine –  Clueweb12 •  Mine other forum datasets –  TREC KBA corpus •  Spam community maintains lists –  Download one •  Other?

Slide 10

Slide 10 text

ISSUES •  Compiled seeds lists don’t scale •  Search engines –  Scraping violates TOS –  Impractical limits on # of results •  inurl  operator not implemented by most •  Blekko: –  query “forumdisplay.php” –  43 unique domains in results –  Cost per query too high •  Spam-lists –  Poor quality of seeds

Slide 11

Slide 11 text

MINING KBA CORPUS •  Copyright restrictions prevent sharing content permalink •  Mined forum directories (common URL subdomains / domains), large discussion sources •  Statistics about scale of crawl needed –  ~50,000 discussion forums supplied 30 million discussions –  Largest discussion sources are social media sites –  vBulletin most used forum software –  Largest contributor is 4chan •  Truly ephemeral – posts 404 after a certain time period •  Impossible for us to crawl

Slide 12

Slide 12 text

MINING USER PROFILES •  Active theming/modding community behind vBulletin/phpBB/Xenforo •  Member profiles link to portfolio •  Scrape member-list from these sites to get more seeds •  Modding community forum list compiled manually

Slide 13

Slide 13 text

MINING USER PROFILES Screenshot from http://www.phpbb.org/community/memberlist.php

Slide 14

Slide 14 text

CRAWL STRATEGY •  Generic crawl strategy (n hops from seed) DOESN’T WORK – Content desired is from specific time-frame – Large sites get several hundred entries per hour •  Reddit – 200,000 submissions in a week – Need several thousand hops to get content – Massive bandwidth / storage costs

Slide 15

Slide 15 text

GENERIC FORUM CRAWL •  Page #2000 on http://ubuntuforums.org contains posts in desired time-range •  Forums can be used as CMS •  A 2000-hops crawl collects posts, (potentially) articles, blogs and other content that we don’t need from time- frames we are not interested in

Slide 16

Slide 16 text

Extra Content

Slide 17

Slide 17 text

GENERIC FORUM CRAWL •  Added problems of identifying / extracting downloaded content •  Numbers – Ran such a crawl from January to February – 2.8 TB of data crawled – Majority of data useless

Slide 18

Slide 18 text

DISCUSSION FORUM CATEGORIES •  Download using API – Reddit – DISQUS comments •  Web based discussion forums – Self hosted: vBulletin, phpBB, myBB – Sites hosted on forum directories Yuku, Zetaboards

Slide 19

Slide 19 text

API DRIVEN SITES •  Straightforward to download data •  Parameters to API calls used to extract exactly the data wanted •  Issues: – How to store data •  Deeply nested data unraveled and stored in relational-db •  Flat data stored in flat-files

Slide 20

Slide 20 text

INDEX PAGES

Slide 21

Slide 21 text

INDEX PAGES

Slide 22

Slide 22 text

INDEX PAGES •  Crawl exhaustively •  Extract links to discussions from crawl •  Identify if discussion is in the time-frame

Slide 23

Slide 23 text

IDENTIFY INDEX PAGES •  URL Format: –  forumdisplay.php,  viewforum.php,     thread.php,  t.php,  showthread.php   •  robots.txt: –  Points to sitemap.xml   –  Provides a path for a search-engine bot to follow (update-frequency, path to page) –  Index pages have high update frequency •  Identify pagination for exhaustive index page crawl

Slide 24

Slide 24 text

INDEX PAGE OBSERVATIONS •  Forum directories have consistent URLs •  vBulletin / phpBB / Xenforo vary based on version / plugin / site-admin’s whim

Slide 25

Slide 25 text

INDEX PAGE PITFALLS •  Sort order of posts –  Maintain a list of URL modifiers (path / query) that lead to reordering •  ?sort=replies  ,  /sort_REPLIES_##  ,     –  List built by eyeballing some initial crawls –  “prev” page link might be different •  Current page: http://a.b.c/pg=3 •  Next page: http://a.b.c/pg=4 •  Prev page: http://a.b.c/pg=-1 •  Not a very frequent problem •  But frequent enough to require attention •  Not handling these can cause same content to be downloaded 5x or 6x

Slide 26

Slide 26 text

INDEX PAGE PITFALLS •  No real solution apart from downloading entire collection of index pages •  Yield not gauged till entire index-page set is done –  1 Month of Yahoo Groups crawls led to less than 100,000 posts (very spammy) •  Cannot easily distribute this crawl across machines •  While index page crawl is running, cannot perform a discussion crawl in parallel (bandwidth restrictions)

Slide 27

Slide 27 text

WHAT YOU GET   •  API SITES Site Content Type Count Reddit Submissions 8.5m Top-Level Discussions ~22.9m downloaded (about 2x this number expected) DISQUS Comments Submissions 30m Top-Level Comments ~3m downloaded (overall count uncertain) NYT Article Comments Articles with discussion ~26K

Slide 28

Slide 28 text

REDDIT EXAMPLES Reddit Submission Top Level Comments

Slide 29

Slide 29 text

DISQUS EXAMPLES

Slide 30

Slide 30 text

WHAT YOU GET   •  Web Based Discussion Sites Site Content Type Count Nabble Generic Forum Discussions ~380K Usenet Newsgroups Newsgroup thread 1.5m threads Stack-Exchange Sites Questions ~1m Comments ~1.4m Posts Overall ~4m

Slide 31

Slide 31 text

STACK-EXCHANGE EXAMPLES

Slide 32

Slide 32 text

WHAT YOU GET   •  Web Based Discussion Sites Site Content Type Count Forum Directories (Yuku, Zetaboards) Generic Forum Discussions ~0.5m vBulletin, phpBB, Xenforo Generic Forum Discussions ~5m post pages Gaia Online Comments ~0.5m downloaded (2m expected)

Slide 33

Slide 33 text

WHAT YOU GET   •  Product Reviews Site Content Type Count Amazon.com Reviews ~0.5m downloaded (realistic estimates not possible) Review Discussions ~50K downloaded (cannot estimate how many more)

Slide 34

Slide 34 text

AMAZON EXAMPLES

Slide 35

Slide 35 text

WHAT YOU GET •  Preserve all site-specific metadata: – Social networks – Upvotes/Karma/Downvotes – Purchase info •  Links from Clueweb12++ to Clueweb12 •  Documents (DISQUS posts, REDDIT Submissions)

Slide 36

Slide 36 text

HARDWARE USED •  64 GB RAM, 8 Cores ( x 3) – Fast growing crawls + API-specific sites – CMU Cluster •  8 GB RAM, 4 Cores ( x 1) – Moderate size crawl + API-specific sites – Personal Workstation •  512 MB RAM, 1 Core ( x 2 ) – API-based sites and optimized scrapers only – AWS Micro Instance

Slide 37

Slide 37 text

ACKNOWLEDGEMENTS •  REDDIT Development Team – Advice on crawling site with their API •  Amazon – Advice on optimizing site crawl •  Livejournal – Unbanned several times •  Blekko – API access to search engine

Slide 38

Slide 38 text

QUESTIONS FEATURE-REQUESTS SUGGESTIONS NOW OR OFFLINE ([email protected])