Seeing In The Dark: Discovery and data-mining of restricted web archives

Seeing In The Dark: Discovery and data-mining of restricted web archives

A6b47d884e877f197e05c06916a956c8?s=128

Andy Jackson

April 25, 2013
Tweet

Transcript

  1. Seeing In The Dark Discovery and data-mining of restricted web

    archives Andrew Jackson, Web Archiving Technical Lead IIPC GENERAL ASSEMBLY | 25-04-2013 | LJUBLJANA
  2. RESTRICTED ARCHIVES Seeing in the dark

  3. Discovery in the dark 3

  4. The JISC UK Web Domain Dataset §  Internet Archive UK

    Domain Dataset §  1996-2010 §  Millions of websites §  2.5 billion resources §  > 35TB §  No direct access §  No bulk downloads §  Open metadata datasets §  Analytical access
  5. OPEN DATASETS Seeing in the dark 5

  6. GeoIndexes – Discovering Local Web History http://data.webarchive.org.uk/opendata/ukwa.ds.2/geo/

  7. Format Profiles – HTML Version Analysis http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/

  8. Top-Level Linkage Analysis http://www.webarchive.org.uk/ukwa/visualisation/ukwa.ds.2/linkage

  9. Host Linkage Dataset 9 1996|appserver.ed.ac.uk|portico.bl.uk 1 1996|art-www.acorn.co.uk|portico.bl.uk 1 1996|astra.ich.ucl.ac.uk|portico.bl.uk 1

    1996|back.niss.ac.uk|portico.bl.uk 1 1996|beta.bids.ac.uk|portico.bl.uk 2 1996|blaiseweb.bl.uk|blaiseweb.bl.uk 4 1996|bonsai.iielr.dmu.ac.uk|portico.bl.uk 4 1996|dominica.lshtm.ac.uk|portico.bl.uk 1 1996|dux.dundee.ac.uk|portico.bl.uk 2 1996|eisv01.lancs.ac.uk|portico.bl.uk 1 http://www.webarchive.org.uk/datasets/ukwa.ds.2/linkage/
  10. WATs 10 §  Web Archive Transformation (WAT) §  https://webarchive.jira.com/wiki/display/Iresearch/Web +Archive+Transformation+(WAT)+Specification,+Utilities,+and

    +Usage+Overview §  Contains links and anchor text. §  Size & distribution: §  6TB of compressed JSON in WARC packaging §  Looking at hosting options §  CC0 licence §  Working with the Oxford Internet Institute §  http://www.oii.ox.ac.uk/research/projects/?id=88
  11. DATA SERVICES Seeing in the dark 11

  12. Full-text Search: Prime Ministers http://www.webarchive.org.uk/ukwa

  13. Analytical Access to the Dark Domain Archive (AADDA) http://domaindarkarchive.blogspot.co.uk/ 13

    §  http:// http://www.webarchive.org.uk/aadda-discovery/browse
  14. GLOBAL INTEGRATION Seeing in the dark 14

  15. Memento 15 §  [Mementos Screenshot] http://www.webarchive.org.uk/mementos/search

  16. Integrated, Global Discovery 16 §  Exploit existing APIs §  Use

    item hash values via Wayback to compare our archives or validate independent archives §  Expose more information alongside the Memento API §  Improve prototype Memento browser plugin(s) §  Develop new APIs §  Expose link information via Wayback and/or Memento §  Lookup by fields other than host and timestamp, e.g. §  In-links §  Hash values
  17. INSIDE-OUT ARCHIVES Seeing in the dark 17

  18. Summary: Inside-Out Archives 18 §  CC0 open datasets §  Analytical

    access services §  Richer APIs §  Integrated, contextualized, global discovery