Large-Scale Web Archive Discovery & Analytics Using Apache Solr

Large-Scale Web Archive Discovery & Analytics Using Apache Solr

Presentation given at the 2014 IIPC GA

A6b47d884e877f197e05c06916a956c8?s=128

Andy Jackson

May 20, 2014
Tweet

Transcript

  1. Large-Scale Web Archive Discovery & Analytics Using Apache Solr Andrew

    Jackson UK Web Archive Technical Lead
  2. www.bl.uk 2 Context •  Three collections: –  Selective since 2004

    –  Legal Deposit since 2013 –  Historical 1996-2013 from IA •  Iterative Development: –  Work directly with researchers –  Today’s historical research tools provide tomorrow’s reading rooms •  Using Solr to support: –  Discovery –  Preservation –  Analytics
  3. www.bl.uk 3 Discovery • Web archives tend to be messy – Lots

    of poor quality content, e.g. from crawler traps. – Spam, e.g. link spam from link farms. – Utility of PageRank over time is unclear • Faceted search – Invest in developing facets to allow filtering rather than PageRank or boosts to rank results. – e.g. basic facets from embedded metadata: • Last-Modified, Author, etc.
  4. www.bl.uk 4 Discovery: HTML Links (also)

  5. www.bl.uk 5 Discovery: Embedded Licenses

  6. www.bl.uk 6 Discovery: Text features • No stemming or lemmatization – Researchers

    hated it • Natural language detection – e.g. gov.uk + fr • Postcode-based geoindex • Sentiment analysis • Similarity hashing via ssdeep – To detect similar texts
  7. www.bl.uk 7 Discovery: Image features • Basic properties: – width, height, pixel

    count • Face detection – Number of faces & location • Dominant colour extraction – ‘Characteristic’ colours
  8. www.bl.uk 8 Preservation •  Format analysis: – Using extended MIME types

    (inc. version + charset): • Served • Apache Tika • DROID – First-four-bytes – File extension • Examples – Understanding Unidentified Resources
  9. www.bl.uk 9 HTML Versions Over Time

  10. www.bl.uk 10 Preservation •  Deeper characterisation – Software identifiers – (X)HTML: Elements

    Used – XML: Root Namespace – PDF: Apache Preflight – Apache Tika's parse errors – Will consider adding: • DRMLint (SCAPE) • JHOVE
  11. www.bl.uk 11 Elements Over Time

  12. www.bl.uk 12 PDF/A Validation Errors

  13. www.bl.uk 13 Parse Errors

  14. www.bl.uk 14 Analytics • Researcher Expectations – “How big is the UK

    Web?” • From Crawl To Web – Crawl schedule, parameters, logs. – "Files over 10MB are not archived” – De-duplication handling critical – Can't forget HTTP 30x, 40x, 50x • Compensate via normalisation strategies – c.f. Google Books Ngram
  15. www.bl.uk 15 Technical Architecture • Core indexer can run from CLI

    or Hadoop – Makes development much easier • Hadoop indexer has two modes: – SolrCloud: • Performance acceptable as long as shards map to cores and there's good I/O (1 billion, 1 server, 1 week) • Memory issues relating to query complexity – Direct to HDFS: • Really fast for moderate data volumes • Slows down as shards grow
  16. www.bl.uk 16 Scale • 1996-2010 Tranch of the IA dataset: – 2.5

    Billion HTTP 200 URLs • Performance issues: – Data quality – Robustness – Configuration errors • Currently re-indexing: – with better duplicate handling – on three dedicated servers
  17. www.bl.uk 17 Open Collaboration • Fully open source stack: – webarchive-discovery indexer

    – Begun developing an analytics UI • Keen to collaborate – This community faces a common problem: • But not a core SolrCloud/ElasticSearch use case – Danish SolrCloud on SSD discovered via Solr mailing list • http://sbdevel.wordpress.com/2013/12/06/danish- webscale/
  18. www.bl.uk 18 Thank you