LTI Summer Talk

CLUEWEB12++ SHRIPHANI PALAKODETY [email protected]

BACKGROUND •  CLUEWEB12 – Crawled February 2012 – May 2012 – 720m
web pages – Archival Web – Shipped early 2013

CLUEWEB12 TRIVIA •  10mn Pages / Day •  7 Machines
– 64 GB Memory Each •  2.8 TB •  Archival Corpus

EPHEMERAL WEB •  Social Media •  Discussion Forums •  Blogs
•  Newsgroups •  Product reviews

GOAL Snapshot of ephemeral web from February – May 2012

DISCUSSIONS •  Discussion Forums – Examples: Gaia Online, final-gear •  Social
Media – Examples: Reddit, StackOverflow •  Newsgroups •  Product Reviews •  Article / Blog comments

WEB CRAWL •  Traditionally: – Seed Page (get out-links) •  Process
out-links •  Download processed links •  Repeat till hop limit •  CLUEWEB - 9 / 12 used this

CRAWLING DISCUSSION FORUMS •  Seeds list: – No curated (or otherwise)
forum list – No clear starting point •  Temporal restrictions: – Do not want entire site – just a slice •  General crawling issues

SEEDS LIST •  Start with small compiled site lists • 
Search a large web corpus –  Search a web-search engine –  Clueweb12 •  Mine other forum datasets –  TREC KBA corpus •  Spam community maintains lists –  Download one •  Other?

ISSUES •  Compiled seeds lists don’t scale •  Search engines
–  Scraping violates TOS –  Impractical limits on # of results •  inurl operator not implemented by most •  Blekko: –  query “forumdisplay.php” –  43 unique domains in results –  Cost per query too high •  Spam-lists –  Poor quality of seeds

MINING KBA CORPUS •  Copyright restrictions prevent sharing content permalink
•  Mined forum directories (common URL subdomains / domains), large discussion sources •  Statistics about scale of crawl needed –  ~50,000 discussion forums supplied 30 million discussions –  Largest discussion sources are social media sites –  vBulletin most used forum software –  Largest contributor is 4chan •  Truly ephemeral – posts 404 after a certain time period •  Impossible for us to crawl

MINING USER PROFILES •  Active theming/modding community behind vBulletin/phpBB/Xenforo • 
Member profiles link to portfolio •  Scrape member-list from these sites to get more seeds •  Modding community forum list compiled manually

MINING USER PROFILES Screenshot from http://www.phpbb.org/community/memberlist.php

CRAWL STRATEGY •  Generic crawl strategy (n hops from seed)
DOESN’T WORK – Content desired is from specific time-frame – Large sites get several hundred entries per hour •  Reddit – 200,000 submissions in a week – Need several thousand hops to get content – Massive bandwidth / storage costs

GENERIC FORUM CRAWL •  Page #2000 on http://ubuntuforums.org contains posts
in desired time-range •  Forums can be used as CMS •  A 2000-hops crawl collects posts, (potentially) articles, blogs and other content that we don’t need from time- frames we are not interested in

Extra Content

GENERIC FORUM CRAWL •  Added problems of identifying / extracting
downloaded content •  Numbers – Ran such a crawl from January to February – 2.8 TB of data crawled – Majority of data useless

DISCUSSION FORUM CATEGORIES •  Download using API – Reddit – DISQUS comments
•  Web based discussion forums – Self hosted: vBulletin, phpBB, myBB – Sites hosted on forum directories Yuku, Zetaboards

API DRIVEN SITES •  Straightforward to download data •  Parameters
to API calls used to extract exactly the data wanted •  Issues: – How to store data •  Deeply nested data unraveled and stored in relational-db •  Flat data stored in flat-files

INDEX PAGES

INDEX PAGES •  Crawl exhaustively •  Extract links to discussions
from crawl •  Identify if discussion is in the time-frame

IDENTIFY INDEX PAGES •  URL Format: –  forumdisplay.php, viewforum.php,
thread.php, t.php, showthread.php •  robots.txt: –  Points to sitemap.xml –  Provides a path for a search-engine bot to follow (update-frequency, path to page) –  Index pages have high update frequency •  Identify pagination for exhaustive index page crawl

INDEX PAGE OBSERVATIONS •  Forum directories have consistent URLs • 
vBulletin / phpBB / Xenforo vary based on version / plugin / site-admin’s whim

INDEX PAGE PITFALLS •  Sort order of posts –  Maintain
a list of URL modifiers (path / query) that lead to reordering •  ?sort=replies , /sort_REPLIES_## , –  List built by eyeballing some initial crawls –  “prev” page link might be different •  Current page: http://a.b.c/pg=3 •  Next page: http://a.b.c/pg=4 •  Prev page: http://a.b.c/pg=-1 •  Not a very frequent problem •  But frequent enough to require attention •  Not handling these can cause same content to be downloaded 5x or 6x

INDEX PAGE PITFALLS •  No real solution apart from downloading
entire collection of index pages •  Yield not gauged till entire index-page set is done –  1 Month of Yahoo Groups crawls led to less than 100,000 posts (very spammy) •  Cannot easily distribute this crawl across machines •  While index page crawl is running, cannot perform a discussion crawl in parallel (bandwidth restrictions)

WHAT YOU GET •  API SITES Site Content Type
Count Reddit Submissions 8.5m Top-Level Discussions ~22.9m downloaded (about 2x this number expected) DISQUS Comments Submissions 30m Top-Level Comments ~3m downloaded (overall count uncertain) NYT Article Comments Articles with discussion ~26K

REDDIT EXAMPLES Reddit Submission Top Level Comments

DISQUS EXAMPLES

WHAT YOU GET •  Web Based Discussion Sites Site
Content Type Count Nabble Generic Forum Discussions ~380K Usenet Newsgroups Newsgroup thread 1.5m threads Stack-Exchange Sites Questions ~1m Comments ~1.4m Posts Overall ~4m

STACK-EXCHANGE EXAMPLES

WHAT YOU GET •  Web Based Discussion Sites Site
Content Type Count Forum Directories (Yuku, Zetaboards) Generic Forum Discussions ~0.5m vBulletin, phpBB, Xenforo Generic Forum Discussions ~5m post pages Gaia Online Comments ~0.5m downloaded (2m expected)

WHAT YOU GET •  Product Reviews Site Content Type
Count Amazon.com Reviews ~0.5m downloaded (realistic estimates not possible) Review Discussions ~50K downloaded (cannot estimate how many more)

AMAZON EXAMPLES

WHAT YOU GET •  Preserve all site-specific metadata: – Social networks
– Upvotes/Karma/Downvotes – Purchase info •  Links from Clueweb12++ to Clueweb12 •  Documents (DISQUS posts, REDDIT Submissions)

HARDWARE USED •  64 GB RAM, 8 Cores ( x
3) – Fast growing crawls + API-specific sites – CMU Cluster •  8 GB RAM, 4 Cores ( x 1) – Moderate size crawl + API-specific sites – Personal Workstation •  512 MB RAM, 1 Core ( x 2 ) – API-based sites and optimized scrapers only – AWS Micro Instance

ACKNOWLEDGEMENTS •  REDDIT Development Team – Advice on crawling site with
their API •  Amazon – Advice on optimizing site crawl •  Livejournal – Unbanned several times •  Blekko – API access to search engine

QUESTIONS FEATURE-REQUESTS SUGGESTIONS NOW OR OFFLINE ([email protected])

LTI Summer Talk

LTI Summer Talk

More Decks by Shriphani Palakodety

Other Decks in Research

Featured

Transcript