Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Digging into the Web Archive at the British Lib...

Andy Jackson
November 27, 2014

Digging into the Web Archive at the British Library

Status update on how we exploit data mining to support management and exploitation of our web archives.

Andy Jackson

November 27, 2014
Tweet

More Decks by Andy Jackson

Other Decks in Research

Transcript

  1. Digging into the Web Archive at the British Library Andrew

    Jackson UK Web Archive Technical Lead
  2. www.bl.uk 2 Collections & Scale •  Three collections: –  By

    permission (2004-2013) •  c. 200 million URLs –  Legal Deposit (2013 onwards) •  c. 2 billion URLs/year (30TB/y) –  JISC/IA Historical (1996-2013) •  c. 6 billion URLs (57TB) •  Use data-mining to support: –  Access –  Search –  Preservation –  Web science
  3. www.bl.uk 5 Search & Analytical Access • ‘Title-level’ search: – Millions of

    homepages found via metadata • Full-text search: – Billions of resources – Dedicated faceted search service • Analytical access: – Combine faceted full-text search with: • Trend analysis • Visualisation tools – Working with modern historians to drive development
  4. www.bl.uk 8 Secondary Datasets • Facts about content, including: – Crawl index

    – Geo-index – Format profiles – Link graphs • Facilitate independent research • Can be made available under CC0 • Hosted at http://data.webarchive.org.uk/opendata/
  5. www.bl.uk 12 Access Service Spectrum • Single-item retrieval • ‘Title-level’ search • Full-text

    search • Analytics & visualisation (at full scale) • Secondary datasets • Remote analysis of datasets (an API, e.g. SPARQL) • Full computational access service (internal only right now) • Not just the web archive?
  6. www.bl.uk 13 Thank you! Email: [email protected] Twitter: @anjacks0n UK Web

    Archive: http://www.webarchive.org.uk Blog: http://britishlibrary.typepad.co.uk/ webarchive/ Twitter: @ukwebarchive