Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Digging into the Web Archive at the British Library

Andy Jackson
November 27, 2014

Digging into the Web Archive at the British Library

Status update on how we exploit data mining to support management and exploitation of our web archives.

Andy Jackson

November 27, 2014
Tweet

More Decks by Andy Jackson

Other Decks in Research

Transcript

  1. Digging into the Web Archive
    at the British Library
    Andrew Jackson
    UK Web Archive Technical Lead

    View Slide

  2. www.bl.uk 2
    Collections & Scale
    •  Three collections:
    –  By permission (2004-2013)
    •  c. 200 million URLs
    –  Legal Deposit (2013 onwards)
    •  c. 2 billion URLs/year (30TB/y)
    –  JISC/IA Historical (1996-2013)
    •  c. 6 billion URLs (57TB)
    •  Use data-mining to support:
    –  Access
    –  Search
    –  Preservation
    –  Web science

    View Slide

  3. www.bl.uk 3
    Single-Item Retrieval

    View Slide

  4. www.bl.uk 4
    Web Archive Architecture

    View Slide

  5. www.bl.uk 5
    Search & Analytical Access
    • ‘Title-level’ search:
    – Millions of homepages found via metadata
    • Full-text search:
    – Billions of resources
    – Dedicated faceted search service
    • Analytical access:
    – Combine faceted full-text search with:
    • Trend analysis
    • Visualisation tools
    – Working with modern historians to drive development

    View Slide

  6. www.bl.uk 6
    Longitudinal Analysis (Prime Ministers)

    View Slide

  7. www.bl.uk 7
    Embedded Licenses

    View Slide

  8. www.bl.uk 8
    Secondary Datasets
    • Facts about content, including:
    – Crawl index
    – Geo-index
    – Format profiles
    – Link graphs
    • Facilitate independent research
    • Can be made available under CC0
    • Hosted at http://data.webarchive.org.uk/opendata/

    View Slide

  9. www.bl.uk 9
    Exploring Links Between Hosts
    Courtesy of Peter Webster, Rainer Simon and Jules Mataly

    View Slide

  10. www.bl.uk 10
    Links From 1996

    View Slide

  11. www.bl.uk 11
    Top-Level Links Over Time [here]

    View Slide

  12. www.bl.uk 12
    Access Service Spectrum
    • Single-item retrieval
    • ‘Title-level’ search
    • Full-text search
    • Analytics & visualisation (at full scale)
    • Secondary datasets
    • Remote analysis of datasets (an API, e.g. SPARQL)
    • Full computational access service (internal only right now)
    • Not just the web archive?

    View Slide

  13. www.bl.uk 13
    Thank you!
    Email: [email protected]
    Twitter: @anjacks0n
    UK Web Archive:
    http://www.webarchive.org.uk
    Blog:
    http://britishlibrary.typepad.co.uk/
    webarchive/
    Twitter: @ukwebarchive

    View Slide