Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Seeing In The Dark: Discovery and data-mining of restricted web archives

Seeing In The Dark: Discovery and data-mining of restricted web archives

Andy Jackson

April 25, 2013
Tweet

More Decks by Andy Jackson

Other Decks in Research

Transcript

  1. Seeing In The Dark
    Discovery and data-mining of
    restricted web archives
    Andrew Jackson,
    Web Archiving Technical Lead
    IIPC GENERAL ASSEMBLY | 25-04-2013 | LJUBLJANA

    View Slide

  2. RESTRICTED ARCHIVES
    Seeing in the dark

    View Slide

  3. Discovery in the dark
    3

    View Slide

  4. The JISC UK Web Domain Dataset
    §  Internet Archive UK Domain Dataset
    §  1996-2010
    §  Millions of websites
    §  2.5 billion resources
    §  > 35TB
    §  No direct access
    §  No bulk downloads
    §  Open metadata datasets
    §  Analytical access

    View Slide

  5. OPEN DATASETS
    Seeing in the dark
    5

    View Slide

  6. GeoIndexes – Discovering Local Web History
    http://data.webarchive.org.uk/opendata/ukwa.ds.2/geo/

    View Slide

  7. Format Profiles – HTML Version Analysis
    http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/

    View Slide

  8. Top-Level Linkage Analysis
    http://www.webarchive.org.uk/ukwa/visualisation/ukwa.ds.2/linkage

    View Slide

  9. Host Linkage Dataset
    9
    1996|appserver.ed.ac.uk|portico.bl.uk 1
    1996|art-www.acorn.co.uk|portico.bl.uk 1
    1996|astra.ich.ucl.ac.uk|portico.bl.uk 1
    1996|back.niss.ac.uk|portico.bl.uk 1
    1996|beta.bids.ac.uk|portico.bl.uk 2
    1996|blaiseweb.bl.uk|blaiseweb.bl.uk 4
    1996|bonsai.iielr.dmu.ac.uk|portico.bl.uk 4
    1996|dominica.lshtm.ac.uk|portico.bl.uk 1
    1996|dux.dundee.ac.uk|portico.bl.uk 2
    1996|eisv01.lancs.ac.uk|portico.bl.uk 1
    http://www.webarchive.org.uk/datasets/ukwa.ds.2/linkage/

    View Slide

  10. WATs
    10
    §  Web Archive Transformation (WAT)
    §  https://webarchive.jira.com/wiki/display/Iresearch/Web
    +Archive+Transformation+(WAT)+Specification,+Utilities,+and
    +Usage+Overview
    §  Contains links and anchor text.
    §  Size & distribution:
    §  6TB of compressed JSON in WARC packaging
    §  Looking at hosting options
    §  CC0 licence
    §  Working with the Oxford Internet Institute
    §  http://www.oii.ox.ac.uk/research/projects/?id=88

    View Slide

  11. DATA SERVICES
    Seeing in the dark
    11

    View Slide

  12. Full-text Search: Prime Ministers
    http://www.webarchive.org.uk/ukwa

    View Slide

  13. Analytical Access to the Dark Domain Archive (AADDA)
    http://domaindarkarchive.blogspot.co.uk/
    13
    §  http://
    http://www.webarchive.org.uk/aadda-discovery/browse

    View Slide

  14. GLOBAL INTEGRATION
    Seeing in the dark
    14

    View Slide

  15. Memento
    15
    §  [Mementos Screenshot]
    http://www.webarchive.org.uk/mementos/search

    View Slide

  16. Integrated, Global Discovery
    16
    §  Exploit existing APIs
    §  Use item hash values via Wayback to compare our archives or
    validate independent archives
    §  Expose more information alongside the Memento API
    §  Improve prototype Memento browser plugin(s)
    §  Develop new APIs
    §  Expose link information via Wayback and/or Memento
    §  Lookup by fields other than host and timestamp, e.g.
    §  In-links
    §  Hash values

    View Slide

  17. INSIDE-OUT ARCHIVES
    Seeing in the dark
    17

    View Slide

  18. Summary: Inside-Out Archives
    18
    §  CC0 open datasets
    §  Analytical access services
    §  Richer APIs
    §  Integrated, contextualized, global discovery

    View Slide