Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LTI Summer Talk

LTI Summer Talk

Talk on the Clueweb12++ dataset

Shriphani Palakodety

August 16, 2013
Tweet

More Decks by Shriphani Palakodety

Other Decks in Research

Transcript

  1. BACKGROUND •  CLUEWEB12 – Crawled February 2012 – May 2012 – 720m

    web pages – Archival Web – Shipped early 2013
  2. CLUEWEB12 TRIVIA •  10mn Pages / Day •  7 Machines

    – 64 GB Memory Each •  2.8 TB •  Archival Corpus
  3. DISCUSSIONS •  Discussion Forums – Examples: Gaia Online, final-gear •  Social

    Media – Examples: Reddit, StackOverflow •  Newsgroups •  Product Reviews •  Article / Blog comments
  4. WEB CRAWL •  Traditionally: – Seed Page (get out-links) •  Process

    out-links •  Download processed links •  Repeat till hop limit •  CLUEWEB - 9 / 12 used this
  5. CRAWLING DISCUSSION FORUMS •  Seeds list: – No curated (or otherwise)

    forum list – No clear starting point •  Temporal restrictions: – Do not want entire site – just a slice •  General crawling issues
  6. SEEDS LIST •  Start with small compiled site lists • 

    Search a large web corpus –  Search a web-search engine –  Clueweb12 •  Mine other forum datasets –  TREC KBA corpus •  Spam community maintains lists –  Download one •  Other?
  7. ISSUES •  Compiled seeds lists don’t scale •  Search engines

    –  Scraping violates TOS –  Impractical limits on # of results •  inurl  operator not implemented by most •  Blekko: –  query “forumdisplay.php” –  43 unique domains in results –  Cost per query too high •  Spam-lists –  Poor quality of seeds
  8. MINING KBA CORPUS •  Copyright restrictions prevent sharing content permalink

    •  Mined forum directories (common URL subdomains / domains), large discussion sources •  Statistics about scale of crawl needed –  ~50,000 discussion forums supplied 30 million discussions –  Largest discussion sources are social media sites –  vBulletin most used forum software –  Largest contributor is 4chan •  Truly ephemeral – posts 404 after a certain time period •  Impossible for us to crawl
  9. MINING USER PROFILES •  Active theming/modding community behind vBulletin/phpBB/Xenforo • 

    Member profiles link to portfolio •  Scrape member-list from these sites to get more seeds •  Modding community forum list compiled manually
  10. CRAWL STRATEGY •  Generic crawl strategy (n hops from seed)

    DOESN’T WORK – Content desired is from specific time-frame – Large sites get several hundred entries per hour •  Reddit – 200,000 submissions in a week – Need several thousand hops to get content – Massive bandwidth / storage costs
  11. GENERIC FORUM CRAWL •  Page #2000 on http://ubuntuforums.org contains posts

    in desired time-range •  Forums can be used as CMS •  A 2000-hops crawl collects posts, (potentially) articles, blogs and other content that we don’t need from time- frames we are not interested in
  12. GENERIC FORUM CRAWL •  Added problems of identifying / extracting

    downloaded content •  Numbers – Ran such a crawl from January to February – 2.8 TB of data crawled – Majority of data useless
  13. DISCUSSION FORUM CATEGORIES •  Download using API – Reddit – DISQUS comments

    •  Web based discussion forums – Self hosted: vBulletin, phpBB, myBB – Sites hosted on forum directories Yuku, Zetaboards
  14. API DRIVEN SITES •  Straightforward to download data •  Parameters

    to API calls used to extract exactly the data wanted •  Issues: – How to store data •  Deeply nested data unraveled and stored in relational-db •  Flat data stored in flat-files
  15. INDEX PAGES •  Crawl exhaustively •  Extract links to discussions

    from crawl •  Identify if discussion is in the time-frame
  16. IDENTIFY INDEX PAGES •  URL Format: –  forumdisplay.php,  viewforum.php,  

      thread.php,  t.php,  showthread.php   •  robots.txt: –  Points to sitemap.xml   –  Provides a path for a search-engine bot to follow (update-frequency, path to page) –  Index pages have high update frequency •  Identify pagination for exhaustive index page crawl
  17. INDEX PAGE OBSERVATIONS •  Forum directories have consistent URLs • 

    vBulletin / phpBB / Xenforo vary based on version / plugin / site-admin’s whim
  18. INDEX PAGE PITFALLS •  Sort order of posts –  Maintain

    a list of URL modifiers (path / query) that lead to reordering •  ?sort=replies  ,  /sort_REPLIES_##  ,     –  List built by eyeballing some initial crawls –  “prev” page link might be different •  Current page: http://a.b.c/pg=3 •  Next page: http://a.b.c/pg=4 •  Prev page: http://a.b.c/pg=-1 •  Not a very frequent problem •  But frequent enough to require attention •  Not handling these can cause same content to be downloaded 5x or 6x
  19. INDEX PAGE PITFALLS •  No real solution apart from downloading

    entire collection of index pages •  Yield not gauged till entire index-page set is done –  1 Month of Yahoo Groups crawls led to less than 100,000 posts (very spammy) •  Cannot easily distribute this crawl across machines •  While index page crawl is running, cannot perform a discussion crawl in parallel (bandwidth restrictions)
  20. WHAT YOU GET   •  API SITES Site Content Type

    Count Reddit Submissions 8.5m Top-Level Discussions ~22.9m downloaded (about 2x this number expected) DISQUS Comments Submissions 30m Top-Level Comments ~3m downloaded (overall count uncertain) NYT Article Comments Articles with discussion ~26K
  21. WHAT YOU GET   •  Web Based Discussion Sites Site

    Content Type Count Nabble Generic Forum Discussions ~380K Usenet Newsgroups Newsgroup thread 1.5m threads Stack-Exchange Sites Questions ~1m Comments ~1.4m Posts Overall ~4m
  22. WHAT YOU GET   •  Web Based Discussion Sites Site

    Content Type Count Forum Directories (Yuku, Zetaboards) Generic Forum Discussions ~0.5m vBulletin, phpBB, Xenforo Generic Forum Discussions ~5m post pages Gaia Online Comments ~0.5m downloaded (2m expected)
  23. WHAT YOU GET   •  Product Reviews Site Content Type

    Count Amazon.com Reviews ~0.5m downloaded (realistic estimates not possible) Review Discussions ~50K downloaded (cannot estimate how many more)
  24. WHAT YOU GET •  Preserve all site-specific metadata: – Social networks

    – Upvotes/Karma/Downvotes – Purchase info •  Links from Clueweb12++ to Clueweb12 •  Documents (DISQUS posts, REDDIT Submissions)
  25. HARDWARE USED •  64 GB RAM, 8 Cores ( x

    3) – Fast growing crawls + API-specific sites – CMU Cluster •  8 GB RAM, 4 Cores ( x 1) – Moderate size crawl + API-specific sites – Personal Workstation •  512 MB RAM, 1 Core ( x 2 ) – API-based sites and optimized scrapers only – AWS Micro Instance
  26. ACKNOWLEDGEMENTS •  REDDIT Development Team – Advice on crawling site with

    their API •  Amazon – Advice on optimizing site crawl •  Livejournal – Unbanned several times •  Blekko – API access to search engine