Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Seeing In The Dark: Discovery and data-mining of restricted web archives

Seeing In The Dark: Discovery and data-mining of restricted web archives

Andy Jackson

April 25, 2013
Tweet

More Decks by Andy Jackson

Other Decks in Research

Transcript

  1. Seeing In The Dark Discovery and data-mining of restricted web

    archives Andrew Jackson, Web Archiving Technical Lead IIPC GENERAL ASSEMBLY | 25-04-2013 | LJUBLJANA
  2. The JISC UK Web Domain Dataset §  Internet Archive UK

    Domain Dataset §  1996-2010 §  Millions of websites §  2.5 billion resources §  > 35TB §  No direct access §  No bulk downloads §  Open metadata datasets §  Analytical access
  3. Host Linkage Dataset 9 1996|appserver.ed.ac.uk|portico.bl.uk 1 1996|art-www.acorn.co.uk|portico.bl.uk 1 1996|astra.ich.ucl.ac.uk|portico.bl.uk 1

    1996|back.niss.ac.uk|portico.bl.uk 1 1996|beta.bids.ac.uk|portico.bl.uk 2 1996|blaiseweb.bl.uk|blaiseweb.bl.uk 4 1996|bonsai.iielr.dmu.ac.uk|portico.bl.uk 4 1996|dominica.lshtm.ac.uk|portico.bl.uk 1 1996|dux.dundee.ac.uk|portico.bl.uk 2 1996|eisv01.lancs.ac.uk|portico.bl.uk 1 http://www.webarchive.org.uk/datasets/ukwa.ds.2/linkage/
  4. WATs 10 §  Web Archive Transformation (WAT) §  https://webarchive.jira.com/wiki/display/Iresearch/Web +Archive+Transformation+(WAT)+Specification,+Utilities,+and

    +Usage+Overview §  Contains links and anchor text. §  Size & distribution: §  6TB of compressed JSON in WARC packaging §  Looking at hosting options §  CC0 licence §  Working with the Oxford Internet Institute §  http://www.oii.ox.ac.uk/research/projects/?id=88
  5. Analytical Access to the Dark Domain Archive (AADDA) http://domaindarkarchive.blogspot.co.uk/ 13

    §  http:// http://www.webarchive.org.uk/aadda-discovery/browse
  6. Integrated, Global Discovery 16 §  Exploit existing APIs §  Use

    item hash values via Wayback to compare our archives or validate independent archives §  Expose more information alongside the Memento API §  Improve prototype Memento browser plugin(s) §  Develop new APIs §  Expose link information via Wayback and/or Memento §  Lookup by fields other than host and timestamp, e.g. §  In-links §  Hash values
  7. Summary: Inside-Out Archives 18 §  CC0 open datasets §  Analytical

    access services §  Richer APIs §  Integrated, contextualized, global discovery