Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large-Scale Web Archive Discovery & Analytics U...

Large-Scale Web Archive Discovery & Analytics Using Apache Solr

Presentation given at the 2014 IIPC GA

Avatar for Andy Jackson

Andy Jackson

May 20, 2014
Tweet

More Decks by Andy Jackson

Other Decks in Research

Transcript

  1. www.bl.uk 2 Context •  Three collections: –  Selective since 2004

    –  Legal Deposit since 2013 –  Historical 1996-2013 from IA •  Iterative Development: –  Work directly with researchers –  Today’s historical research tools provide tomorrow’s reading rooms •  Using Solr to support: –  Discovery –  Preservation –  Analytics
  2. www.bl.uk 3 Discovery • Web archives tend to be messy – Lots

    of poor quality content, e.g. from crawler traps. – Spam, e.g. link spam from link farms. – Utility of PageRank over time is unclear • Faceted search – Invest in developing facets to allow filtering rather than PageRank or boosts to rank results. – e.g. basic facets from embedded metadata: • Last-Modified, Author, etc.
  3. www.bl.uk 6 Discovery: Text features • No stemming or lemmatization – Researchers

    hated it • Natural language detection – e.g. gov.uk + fr • Postcode-based geoindex • Sentiment analysis • Similarity hashing via ssdeep – To detect similar texts
  4. www.bl.uk 7 Discovery: Image features • Basic properties: – width, height, pixel

    count • Face detection – Number of faces & location • Dominant colour extraction – ‘Characteristic’ colours
  5. www.bl.uk 8 Preservation •  Format analysis: – Using extended MIME types

    (inc. version + charset): • Served • Apache Tika • DROID – First-four-bytes – File extension • Examples – Understanding Unidentified Resources
  6. www.bl.uk 10 Preservation •  Deeper characterisation – Software identifiers – (X)HTML: Elements

    Used – XML: Root Namespace – PDF: Apache Preflight – Apache Tika's parse errors – Will consider adding: • DRMLint (SCAPE) • JHOVE
  7. www.bl.uk 14 Analytics • Researcher Expectations – “How big is the UK

    Web?” • From Crawl To Web – Crawl schedule, parameters, logs. – "Files over 10MB are not archived” – De-duplication handling critical – Can't forget HTTP 30x, 40x, 50x • Compensate via normalisation strategies – c.f. Google Books Ngram
  8. www.bl.uk 15 Technical Architecture • Core indexer can run from CLI

    or Hadoop – Makes development much easier • Hadoop indexer has two modes: – SolrCloud: • Performance acceptable as long as shards map to cores and there's good I/O (1 billion, 1 server, 1 week) • Memory issues relating to query complexity – Direct to HDFS: • Really fast for moderate data volumes • Slows down as shards grow
  9. www.bl.uk 16 Scale • 1996-2010 Tranch of the IA dataset: – 2.5

    Billion HTTP 200 URLs • Performance issues: – Data quality – Robustness – Configuration errors • Currently re-indexing: – with better duplicate handling – on three dedicated servers
  10. www.bl.uk 17 Open Collaboration • Fully open source stack: – webarchive-discovery indexer

    – Begun developing an analytics UI • Keen to collaborate – This community faces a common problem: • But not a core SolrCloud/ElasticSearch use case – Danish SolrCloud on SSD discovered via Solr mailing list • http://sbdevel.wordpress.com/2013/12/06/danish- webscale/