Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Seeing In The Dark: Discovery and data-mining o...

Seeing In The Dark: Discovery and data-mining of restricted web archives

Avatar for Andy Jackson

Andy Jackson

April 25, 2013
Tweet

More Decks by Andy Jackson

Other Decks in Research

Transcript

  1. Seeing In The Dark Discovery and data-mining of restricted web

    archives Andrew Jackson, Web Archiving Technical Lead IIPC GENERAL ASSEMBLY | 25-04-2013 | LJUBLJANA
  2. The JISC UK Web Domain Dataset §  Internet Archive UK

    Domain Dataset §  1996-2010 §  Millions of websites §  2.5 billion resources §  > 35TB §  No direct access §  No bulk downloads §  Open metadata datasets §  Analytical access
  3. Host Linkage Dataset 9 1996|appserver.ed.ac.uk|portico.bl.uk 1 1996|art-www.acorn.co.uk|portico.bl.uk 1 1996|astra.ich.ucl.ac.uk|portico.bl.uk 1

    1996|back.niss.ac.uk|portico.bl.uk 1 1996|beta.bids.ac.uk|portico.bl.uk 2 1996|blaiseweb.bl.uk|blaiseweb.bl.uk 4 1996|bonsai.iielr.dmu.ac.uk|portico.bl.uk 4 1996|dominica.lshtm.ac.uk|portico.bl.uk 1 1996|dux.dundee.ac.uk|portico.bl.uk 2 1996|eisv01.lancs.ac.uk|portico.bl.uk 1 http://www.webarchive.org.uk/datasets/ukwa.ds.2/linkage/
  4. WATs 10 §  Web Archive Transformation (WAT) §  https://webarchive.jira.com/wiki/display/Iresearch/Web +Archive+Transformation+(WAT)+Specification,+Utilities,+and

    +Usage+Overview §  Contains links and anchor text. §  Size & distribution: §  6TB of compressed JSON in WARC packaging §  Looking at hosting options §  CC0 licence §  Working with the Oxford Internet Institute §  http://www.oii.ox.ac.uk/research/projects/?id=88
  5. Analytical Access to the Dark Domain Archive (AADDA) http://domaindarkarchive.blogspot.co.uk/ 13

    §  http:// http://www.webarchive.org.uk/aadda-discovery/browse
  6. Integrated, Global Discovery 16 §  Exploit existing APIs §  Use

    item hash values via Wayback to compare our archives or validate independent archives §  Expose more information alongside the Memento API §  Improve prototype Memento browser plugin(s) §  Develop new APIs §  Expose link information via Wayback and/or Memento §  Lookup by fields other than host and timestamp, e.g. §  In-links §  Hash values
  7. Summary: Inside-Out Archives 18 §  CC0 open datasets §  Analytical

    access services §  Richer APIs §  Integrated, contextualized, global discovery